Project Objective, Overview & Research
¶
Project Objective, Overview & Research
Context:¶
Australia's climate has increased in temperature since 1910 [1], we can use environmental sensor data (soil, Co2 and temperature, ambient air quality to predict impacts on the rate of population growth, economic growth and technological change. Science Matters (2022) '...spatially comprehensive map[ing] of precipitation for the entire Australian continent is crucial for research and policy making in the face of changing climate extremes and variability'[2].¶
Eco-Metrix: Urban Environmental Insight System, explores various environmental datasets from the City of Melbourne. A unique opportunity to work with datasets containing a limited number of records, correlated with larger datasets (2 million records) within a 2km radius of samples, to create a strong use case for predictive analyses (linear regression) modelling based on time series [4], [5], [6].¶
Objective: The goal is to integrate these datasets to monitor, analyse, and predict environmental conditions in a specific geolocation, aiding in city planning and public safety. The datasets use various metrics, such as soil moisture, temperature, salinity, ambient air quality, and CO2 levels, crucial for understanding and managing the city's environmental health. Known previously as Soil analyses.¶
User Stories:¶
As a city planner, I need to monitor environmental sensor data to quickly identify and respond to hazardous conditions, ensuring public safety.
As a resident of the city, I am interested in real-time environmental conditions around me, especially air quality and green space health, to plan my outdoor activities.
Deliverables:¶
- Data Preprocessing:
- Combine datasets that can be combined to identify (data from sensors).
- Clean (normalise) the data.
Integrated Data Dashboard: Create a dashboard displaying data from all sensors, including soil conditions, microclimate, and CO2 levels. Visualizations include maps, charts, and graphs.
Predictive Analytics Model: Develop a model to predict environmental conditions and alert for anomalies or hazardous conditions based on historical and current sensor data.
Data Quality Report: Generate a report detailing data completeness, accuracy, and anomalies across datasets, highlighting any gaps or inconsistencies.
Documentation: Detailed guide on accessing, filtering, and interpreting the data, including how to use the API to download specific sensor readings.
References:¶
[1] NSW and Australian Regional Climate Modelling (NARCliM) project, 2024,https://www.climatechange.environment.nsw.gov.au/evidence-climate-change/australian-climate-change-observations
[2] Science Matters (2022), University of Melbourne, https://pursuit.unimelb.edu.au/articles/the-satellite-data-mapping-australia-s-new-climate-extremes
[3] S.Kumar, 2022, Towards Data Science,https://towardsdatascience.com/4-techniques-to-handle-missing-values-in-time-series-data-c3568589b5a8
[4] N. Rogge, 2022, Hugging Face,https://huggingface.co/blog/time-series-transformers
Part 1 Data Preprocessing:¶
- Combine datasets to identify (data from sensors).
- Clean (normalise) the data.
Part 2 Integrated Data Dashboard: Create a dashboard displaying data from all sensors, including soil conditions, microclimate, and CO2 levels. Visualizations include maps, charts, and graphs.¶
Part 3 Predictive Analytics Model: Develop a model to predict environmental conditions and alert for anomalies or hazardous conditions based on historical and current sensor data.¶
Part 4 Data Quality Report: Conclusion and information on results/ interpreting data.¶
Set Up¶
# Dependencies
import warnings
import pandas as pd
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
import requests
import numpy as np
from io import StringIO
# Optional code
###########################################################
# Show max width of columns in Google Collab ->
# pd.set_option('display.max_colwidth', None)
###########################################################
# Show min width of columns in Google Collab ->
pd.reset_option('display.max_colwidth')
###########################################################
!pip install tqdm # Progress bar
!pip install osmnx
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (4.66.2)
Collecting osmnx
Downloading osmnx-1.9.2-py3-none-any.whl (107 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 107.4/107.4 kB 2.3 MB/s eta 0:00:00
Requirement already satisfied: geopandas>=0.12 in /usr/local/lib/python3.10/dist-packages (from osmnx) (0.13.2)
Requirement already satisfied: networkx>=2.5 in /usr/local/lib/python3.10/dist-packages (from osmnx) (3.3)
Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.10/dist-packages (from osmnx) (1.25.2)
Requirement already satisfied: pandas>=1.1 in /usr/local/lib/python3.10/dist-packages (from osmnx) (2.0.3)
Requirement already satisfied: requests>=2.27 in /usr/local/lib/python3.10/dist-packages (from osmnx) (2.31.0)
Requirement already satisfied: shapely>=2.0 in /usr/local/lib/python3.10/dist-packages (from osmnx) (2.0.4)
Requirement already satisfied: fiona>=1.8.19 in /usr/local/lib/python3.10/dist-packages (from geopandas>=0.12->osmnx) (1.9.6)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from geopandas>=0.12->osmnx) (24.0)
Requirement already satisfied: pyproj>=3.0.1 in /usr/local/lib/python3.10/dist-packages (from geopandas>=0.12->osmnx) (3.6.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1->osmnx) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1->osmnx) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1->osmnx) (2024.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.27->osmnx) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.27->osmnx) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.27->osmnx) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.27->osmnx) (2024.2.2)
Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8.19->geopandas>=0.12->osmnx) (23.2.0)
Requirement already satisfied: click~=8.0 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8.19->geopandas>=0.12->osmnx) (8.1.7)
Requirement already satisfied: click-plugins>=1.0 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8.19->geopandas>=0.12->osmnx) (1.1.1)
Requirement already satisfied: cligj>=0.5 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8.19->geopandas>=0.12->osmnx) (0.7.2)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8.19->geopandas>=0.12->osmnx) (1.16.0)
Installing collected packages: osmnx
Successfully installed osmnx-1.9.2
import matplotlib.pyplot as plt
import seaborn as sns
import math
import osmnx as ox
import geopandas as gpd
import re
import networkx as nx
import folium
from folium import plugins
from folium.plugins import HeatMap
from geopy.geocoders import Nominatim
from tqdm.auto import tqdm
import ipywidgets as widgets
from ipywidgets import interact
from matplotlib.ticker import MaxNLocator
from IPython.display import display
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from scipy.spatial import cKDTree
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# API Key
from google.colab import drive
drive.mount('/content/drive')
with open('/content/drive/My Drive/SIT378/h.txt', 'r') as file:
api_key = file.read().strip()
import os
api_key = os.getenv(api_key)
Mounted at /content/drive
# Define the company colors
color_d = ['#08af64', '#14a38e', '#0f9295', '#056b8a', '#121212'] #Dark theme
color_l = ['#2af598', '#22e4ac', '#1bd7bb', '#14c9cb', '#0fbed8', '#08b3e5'] #Light theme
A. Download datasets (or skip to B. if already downloaded)¶
Loaded datasets into dictionary:¶
- historical_soil_sensor = datasets['soil-sensor-readings-historical-data']
- soil_sensor_locations = datasets['soil-sensor-locations']
- microclimate_sensor_readings = datasets['microclimate-sensor-readings']
- microlabs_co2_sensor_data = datasets['microlabs-co2-sensor-data']
# Download datasets
def download_dataset(api_key, dataset_id, base_url='https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'):
format = 'csv'
url = f'{base_url}{dataset_id}/exports/{format}'
params = {
'select': '*',
'limit': -1, # all records
'lang': 'en',
'timezone': 'UTC',
'api_key': api_key
}
with requests.get(url, params=params, stream=True) as response:
if response.status_code == 200:
total_size = int(response.headers.get('content-length', 0))
chunk_size = 1024 # 1KB per chunk
progress_bar = tqdm(total=total_size, unit='iB', unit_scale=True, desc=f"Downloading {dataset_id}")
content = bytearray()
for chunk in response.iter_content(chunk_size=chunk_size):
if chunk: # filter out keep-alive new chunks
content.extend(chunk)
progress_bar.update(len(chunk))
progress_bar.close()
data = pd.read_csv(StringIO(content.decode('utf-8')), delimiter=';')
return data
else:
print(f'Request failed with status code {response.status_code}')
return None
# Dataset IDs
dataset_ids = [
'microclimate-sensor-locations',
'soil-sensor-readings-historical-data',
'soil-sensor-locations',
'microclimate-sensor-readings',
'microlabs-co2-sensor-data'
]
# Initialize dictionary to hold the datasets
datasets = {}
# Download each dataset with a progress bar
for dataset_id in dataset_ids:
datasets[dataset_id] = download_dataset(api_key, dataset_id)
if datasets[dataset_id] is not None:
print(f"{dataset_id} downloaded successfully.")
Downloading microclimate-sensor-locations: 0.00iB [00:00, ?iB/s]
microclimate-sensor-locations downloaded successfully.
Downloading soil-sensor-readings-historical-data: 0.00iB [00:00, ?iB/s]
soil-sensor-readings-historical-data downloaded successfully.
Downloading soil-sensor-locations: 0.00iB [00:00, ?iB/s]
soil-sensor-locations downloaded successfully.
Downloading microclimate-sensor-readings: 0.00iB [00:00, ?iB/s]
microclimate-sensor-readings downloaded successfully.
Downloading microlabs-co2-sensor-data: 0.00iB [00:00, ?iB/s]
microlabs-co2-sensor-data downloaded successfully.
View Dataset¶
- Sample records
# soil sensor readings historical data
print(datasets['soil-sensor-readings-historical-data'].head())
local_time site_name site_id id \ 0 2023-12-12T11:00:00+00:00 Princes Park Lawn 5 64993 20008264 1 2023-12-12T10:15:00+00:00 Block 2 193854 20008441 2 2023-12-12T10:30:00+00:00 Block 2 193854 20008455 3 2023-12-12T11:30:00+00:00 Block 2 193854 20008550 4 2023-12-12T13:00:00+00:00 Princes Park Lawn 5 64993 20008650 probe_id probe_measure soil_value unit json_featuretype 0 1303872 Soil Moisture 70cm #0 58.72 %VWC Output 1 2442542 #6 Temperature 24.70 ºC Output 2 2442542 #6 Temperature 24.70 ºC Output 3 2442542 #6 Temperature 24.60 ºC Output 4 1303872 Soil Moisture 70cm #0 58.76 %VWC Output
# soil sensor locations
print(datasets['soil-sensor-locations'].head())
site_id site_name property_name latitude \
0 51267 Fitzroy East Water Meter Fitzroy Gardens -37.810691
1 51268 Fitzroy West Water Meter Fitzroy Gardens -37.810532
2 51809 Melbourne (Olympic Park) BOM WS Olympic Park -37.825573
3 64970 Princes Park Oval 2 Princes Park -37.786400
4 64975 McAlister Oval Royal Park -37.778805
longitude location
0 144.981780 -37.81069103239338, 144.98178015514114
1 144.979830 -37.81053171390928, 144.97983004297106
2 144.981587 -37.82557277749247, 144.98158738017247
3 144.962590 -37.78639999998354, 144.9625900000018
4 144.956444 -37.77880526150242, 144.95644368244822
# microclimate sensor readings
print(datasets['microclimate-sensor-readings'].head())
local_time id site_id sensor_id value \
0 2023-11-29T22:15:02+00:00 8110553 1011 5b 74.38
1 2023-11-29T22:15:02+00:00 8110558 1011 5c 1010.80
2 2023-11-29T22:15:03+00:00 8110562 1011 6 0.00
3 2023-11-29T22:15:04+00:00 8110606 1012 6 0.00
4 2023-11-29T22:15:11+00:00 8110489 1009 0a 5.30
type units gatewayhub_id site_status
0 TPH.RH % arc1055 C
1 TPH.PRESSURE hPa arc1055 C
2 WS km/h arc1055 C
3 WS km/h arc1048 C
4 PM2.5 ug/m^3 arc1050 C
# microlabs CO2 sensor data
print(datasets['microlabs-co2-sensor-data'].head())
dev_id sensor_name time temperature humidity \ 0 ers-co2-9da0 ERS CO2 2022-05-04T01:40:25+00:00 20.9 47 1 ers-co2-9da0 NaN 2022-05-04T01:33:25+00:00 20.9 47 2 ers-co2-9da0 NaN 2022-05-04T01:34:35+00:00 20.9 47 3 ers-co2-9da0 NaN 2022-05-04T01:31:05+00:00 20.9 47 4 ers-co2-9da0 ERS CO2 2022-05-04T08:25:14+00:00 20.8 46 light motion visit vdd lat_long co2 0 441.0 0 0 3601 -37.8133183, 144.9668853 507.0 1 505.0 0 0 3599 NaN 525.0 2 528.0 0 0 3601 NaN 527.0 3 441.0 0 0 3596 NaN 535.0 4 90.0 0 0 3596 -37.8133183, 144.9668853 495.0
# microclimate-sensor-locations
print(datasets['microclimate-sensor-locations'].head())
site_id gatewayhub_id site_status last_data_local start_reading \ 0 1004 arc1048 R NaN 2019-11-15 1 1006 arc1112 R NaN 2021-05-20 2 1011 arc1055 C NaN 2021-07-07 3 1008 arc1045 R NaN 2021-06-14 4 1016 arc1049 C NaN 2021-09-17 end_reading latitude longitude location 0 2021-06-13 -37.800575 144.964635 -37.800575, 144.964635 1 2021-06-28 -37.822486 144.952065 -37.822486, 144.952065 2 NaN -37.822222 144.952222 -37.822222, 144.952222 3 2021-06-20 -37.817465 144.967057 -37.81746522, 144.96705703 4 NaN -37.812778 144.960556 -37.812778, 144.960556
# soil sensor readings historical data
print(datasets['microclimate-sensor-locations'].head())
# View number of records
for dataset_id, df in datasets.items():
if df is not None:
num_records = df.shape[0] # number of rows
print(f"{dataset_id} has {num_records} records.")
microclimate-sensor-locations has 16 records.
Save datasets locally or on drive¶
- Save each dataset as a .CSV file in local file system (Google Collab) :
base_path = '/content/drive/My Drive/sit378_eco_metrix/'
for dataset_id, df in datasets.items():
if df is not None:
filename = f"{base_path}{dataset_id}.csv"
df.to_csv(filename, index=False)
print(f"Saved {filename} to Google Drive.")
Saved /content/drive/My Drive/sit378_eco_metrix/microclimate-sensor-locations.csv to Google Drive.
B. Load datasets (if not downloaded)¶
- Load each dataset as .CSV file in local file system (Google Collab)
base_path = '/content/drive/My Drive/sit378_eco_metrix/'
datasets = {} # Dictionary
# Filenames
dataset_filenames = {
'soil-sensor-readings-historical-data': 'soil-sensor-readings-historical-data.csv',
'soil-sensor-locations': 'soil-sensor-locations.csv',
'microclimate-sensor-readings': 'microclimate-sensor-readings.csv',
'microlabs-co2-sensor-data': 'microlabs-co2-sensor-data.csv',
'microclimate-sensor-locations': 'microclimate-sensor-locations.csv'
}
# Load each dataset into the datasets dictionary
for dataset_id, filename in dataset_filenames.items():
full_path = f"{base_path}{filename}"
datasets[dataset_id] = pd.read_csv(full_path)
print(f"Loaded {dataset_id} with {datasets[dataset_id].shape[0]} records.")
Loaded soil-sensor-readings-historical-data with 2400683 records. Loaded soil-sensor-locations with 90 records. Loaded microclimate-sensor-readings with 56 records. Loaded microlabs-co2-sensor-data with 207523 records. Loaded microclimate-sensor-locations with 16 records.
Part 1 Data Preprocessing:¶
- Convert to datatime
- Combine datasets to identify ('site_id' check merge)
- Clean (normalise) the data.
# Convert 'local_time' and 'time' to datetime
datasets['microclimate-sensor-readings']['local_time'] = pd.to_datetime(datasets['microclimate-sensor-readings']['local_time'])
datasets['soil-sensor-readings-historical-data']['local_time'] = pd.to_datetime(datasets['soil-sensor-readings-historical-data']['local_time'])
datasets['microlabs-co2-sensor-data']['time'] = pd.to_datetime(datasets['microlabs-co2-sensor-data']['time'])
datasets['microclimate-sensor-locations']['start_reading'] = pd.to_datetime(datasets['microclimate-sensor-locations']['start_reading'])
# 'site_id' Check merge
print(datasets['soil-sensor-readings-historical-data']['site_id'].nunique()) # Total records 2400683
print(datasets['soil-sensor-locations']['site_id'].nunique()) # Total records 90
print(datasets['microclimate-sensor-readings']['site_id'].nunique()) # Total records 56
print(datasets['microclimate-sensor-locations']['site_id'].nunique()) # Total records 16
73 90 3 16
# Merge operation
soil_sensor_data= pd.merge(datasets['soil-sensor-readings-historical-data'],
datasets['soil-sensor-locations'],
on='site_id',
how='inner')
print(soil_sensor_data.info())
# dataframes
microlabs_co2_sensor_data = datasets['microlabs-co2-sensor-data']
microclimate_sensor_readings = datasets['microclimate-sensor-readings']
microclimate_sensor_locations = datasets['microclimate-sensor-locations']
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2400683 entries, 0 to 2400682 Data columns (total 14 columns): # Column Dtype --- ------ ----- 0 local_time datetime64[ns, UTC] 1 site_name_x object 2 site_id int64 3 id int64 4 probe_id int64 5 probe_measure object 6 soil_value float64 7 unit object 8 json_featuretype object 9 site_name_y object 10 property_name object 11 latitude float64 12 longitude float64 13 location object dtypes: datetime64[ns, UTC](1), float64(3), int64(3), object(7) memory usage: 256.4+ MB None
1. Data Preprocessing:¶
- Address missing values
- Check data types
- Locate/ fix dataset-specific issues
- Combine data sets for analysis (Step 2)
Logic¶
When Combine Datasets:¶
- If datasets are similar, you would clean before combining (filling NaN values, standardising units, converting data types).
- Or if the datasets are related, you need to understand these relationships to properly clean the data.
However these datasets have different structures¶
- Cleaning beforehand helps to ensure that we only combine relevant, clean data, making subsequent analysis more straightforward.
- Some datasets might have unique columns that need specific cleaning steps which are not applicable to others.
# Check missing values for dataset
mval_microlabs_locations = microclimate_sensor_locations.isnull().sum()
print(f"mval_microlabs_locations: {mval_microlabs_locations}")
print(mval_microlabs_locations.dtypes)
mval_microlabs_locations: site_id 0 gatewayhub_id 0 site_status 0 last_data_local 16 start_reading 0 end_reading 9 latitude 0 longitude 0 location 0 dtype: int64 int64
# Convert 'start_date' and 'end_date' columns to datetime
microclimate_sensor_locations['start_reading'] = pd.to_datetime(microclimate_sensor_locations['start_reading'])
# Drop
microclimate_sensor_locations = microclimate_sensor_locations.drop(columns=['last_data_local'])
microclimate_sensor_locations = microclimate_sensor_locations.drop(columns=['end_reading'])
microclimate_sensor_locations.head()
| site_id | gatewayhub_id | site_status | start_reading | latitude | longitude | location | |
|---|---|---|---|---|---|---|---|
| 0 | 1004 | arc1048 | R | 2019-11-15 | -37.800575 | 144.964635 | -37.800575, 144.964635 |
| 1 | 1006 | arc1112 | R | 2021-05-20 | -37.822486 | 144.952065 | -37.822486, 144.952065 |
| 2 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 |
| 3 | 1008 | arc1045 | R | 2021-06-14 | -37.817465 | 144.967057 | -37.81746522, 144.96705703 |
| 4 | 1016 | arc1049 | C | 2021-09-17 | -37.812778 | 144.960556 | -37.812778, 144.960556 |
Check missing vaues microlabs_co2_sensor_data¶
- sensor_name (8) - object - sensor_name is the same, copied to missing cells
- light (3) - float64 - mean all days of similar time, copied to missing cells
- lat_long (8) - object - lat_long is the same, copied to missing cells
- co2(2) - float64 - Dropped records
microlabs_co2_sensor_data['latitude'] = '-37.8133183'
microlabs_co2_sensor_data['longitude'] = '144.9668853'
microlabs_co2_sensor_data['sensor_name'] = 'ERS CO2'
microlabs_co2_sensor_data['date'] = microlabs_co2_sensor_data['time'].dt.date
# Filter records where 'light' column has missing/null values
null_light_records = microlabs_co2_sensor_data[microlabs_co2_sensor_data['light'].isnull()]
df = microlabs_co2_sensor_data # Check records near missing data
missing_indices = df[df['light'].isnull()].index # Find indices where 'light' is null
first_missing_index = missing_indices[0]
context_range = 5 # How many rows before and after the missing value to display
df.loc[first_missing_index - context_range:first_missing_index + context_range]
# No exact records found Compared to other days
# 0.0 = 2023-07-21 17:10:45+00:00, 2023-06-29 21:22:18+00:00
target_time = pd.Timestamp('2023-07-16 02:26:09+00:00')
days_range = 2
different_days_records = microlabs_co2_sensor_data[(microlabs_co2_sensor_data['time'] >= target_time - pd.Timedelta(days=days_range)) &
(microlabs_co2_sensor_data['time'] <= target_time + pd.Timedelta(days=days_range))]
print(different_days_records)
dev_id sensor_name time temperature \
123307 ers-co2-9da0 ERS CO2 2023-07-14 06:46:17+00:00 22.3
123308 ers-co2-9da0 ERS CO2 2023-07-14 07:06:17+00:00 21.5
123309 ers-co2-9da0 ERS CO2 2023-07-14 05:26:17+00:00 24.5
123310 ers-co2-9da0 ERS CO2 2023-07-14 06:06:17+00:00 24.5
123311 ers-co2-9da0 ERS CO2 2023-07-14 06:51:17+00:00 22.2
... ... ... ... ...
173124 ers-co2-9da0 ERS CO2 2023-07-18 01:16:01+00:00 18.0
173125 ers-co2-9da0 ERS CO2 2023-07-18 01:21:01+00:00 18.7
173126 ers-co2-9da0 ERS CO2 2023-07-18 01:46:01+00:00 20.9
173127 ers-co2-9da0 ERS CO2 2023-07-18 02:21:01+00:00 22.6
173128 ers-co2-9da0 ERS CO2 2023-07-18 02:16:01+00:00 22.4
humidity light motion visit vdd lat_long co2 \
123307 37 80.0 0 0 3654 -37.8133183, 144.9668853 429.0
123308 38 30.0 0 0 3654 -37.8133183, 144.9668853 419.0
123309 32 336.0 0 0 3651 -37.8133183, 144.9668853 419.0
123310 33 138.0 0 0 3654 -37.8133183, 144.9668853 429.0
123311 37 67.0 0 0 3654 -37.8133183, 144.9668853 422.0
... ... ... ... ... ... ... ...
173124 47 229.0 0 0 3640 -37.8133183, 144.9668853 414.0
173125 45 390.0 0 0 3640 -37.8133183, 144.9668853 426.0
173126 41 771.0 0 0 3643 -37.8133183, 144.9668853 421.0
173127 37 333.0 0 0 3646 -37.8133183, 144.9668853 420.0
173128 38 323.0 0 0 3646 -37.8133183, 144.9668853 426.0
latitude longitude date
123307 -37.8133183 144.9668853 2023-07-14
123308 -37.8133183 144.9668853 2023-07-14
123309 -37.8133183 144.9668853 2023-07-14
123310 -37.8133183 144.9668853 2023-07-14
123311 -37.8133183 144.9668853 2023-07-14
... ... ... ...
173124 -37.8133183 144.9668853 2023-07-18
173125 -37.8133183 144.9668853 2023-07-18
173126 -37.8133183 144.9668853 2023-07-18
173127 -37.8133183 144.9668853 2023-07-18
173128 -37.8133183 144.9668853 2023-07-18
[1137 rows x 14 columns]
# Find Mean of values around this time for all records
# 02:26:09+00:00
time_filter = (microlabs_co2_sensor_data['time'].dt.time >= pd.to_datetime('02:26:09').time()) & \
(microlabs_co2_sensor_data['time'].dt.time <= pd.to_datetime('02:26:09').time())
records_around_time = microlabs_co2_sensor_data[time_filter]
mean_light_around_time = records_around_time['light'].mean()
print(f"Mean of 'light' values for all days around 02:26:09+00:00: {mean_light_around_time}\n")
# 17:10:45+00:00
time_filter = (microlabs_co2_sensor_data['time'].dt.time >= pd.to_datetime('17:10:45').time()) & \
(microlabs_co2_sensor_data['time'].dt.time <= pd.to_datetime('17:10:45').time())
records_around_time = microlabs_co2_sensor_data[time_filter]
mean_light_around_time = records_around_time['light'].mean()
print(f"Mean of 'light' values for all days around 17:10:45+00:00: {mean_light_around_time}\n")
# 21:22:18+00:00
time_filter = (microlabs_co2_sensor_data['time'].dt.time >= pd.to_datetime('21:22:18').time()) & \
(microlabs_co2_sensor_data['time'].dt.time <= pd.to_datetime('21:22:18').time())
records_around_time = microlabs_co2_sensor_data[time_filter]
mean_light_around_time = records_around_time['light'].mean()
print(f"Mean of 'light' values for all days around 21:22:18+00:00: {mean_light_around_time}\n")
Mean of 'light' values for all days around 02:26:09+00:00: 234.0 Mean of 'light' values for all days around 17:10:45+00:00: 0.0 Mean of 'light' values for all days around 21:22:18+00:00: 94.0
# update missing values in light
replacement_values = {
'2023-06-29 21:22:18+00:00': 94.0,
'2023-07-21 17:10:45+00:00': 0.0,
'2023-07-16 02:26:09+00:00': 234.0
}
# Replace
for time, value in replacement_values.items():
microlabs_co2_sensor_data.loc[(microlabs_co2_sensor_data['time'] == time) & (microlabs_co2_sensor_data['light'].isnull()), 'light'] = value
missing_light_records = microlabs_co2_sensor_data[microlabs_co2_sensor_data['light'].isnull()]
if missing_light_records.empty:
print("All missing 'light' records have been replaced with the specified values.")
else:
print("Some 'light' records are still missing.")
All missing 'light' records have been replaced with the specified values.
microlabs_co2_sensor_data.dropna(subset=['co2'], inplace=True) # Review Nan records 'Co2'
# Check NaN 'light' values
missing_light_records_after_drop = microlabs_co2_sensor_data[microlabs_co2_sensor_data['co2'].isnull()]
microlabs_co2_sensor_data.reset_index(drop=True, inplace=True) # Reset Index
mval_microlabs = microlabs_co2_sensor_data.isnull().sum()
print(f"microlabs_co2_sensor_data: {mval_microlabs}")
microlabs_co2_sensor_data: dev_id 0 sensor_name 0 time 0 temperature 0 humidity 0 light 0 motion 0 visit 0 vdd 0 lat_long 8 co2 0 latitude 0 longitude 0 date 0 dtype: int64
Check Missing Values soil_sensor_data:¶
- local_time (928) - datetime64[ns, UTC] -
Last Observation Carried Forward:¶
LOCF is a simple but elegant hack where the previous non-missing values are carried or copied forward and replaced with the missing values. [3]¶
df[‘Forward_Fill’] = df[‘AvgTemperature’].ffill()
- soil_value (404) - float64 -
- property_name (1216248) - object -
# Re Check Data
df_soil_sensor_data = soil_sensor_data.copy()
df_soil_sensor_data['local_time'] = df_soil_sensor_data['local_time'].fillna(method='ffill')
# Check NaN values
total_nat_count_original = soil_sensor_data['local_time'].isna().sum()
total_nat_count = df_soil_sensor_data['local_time'].isna().sum()
print(f"Total number of NaN: {total_nat_count_original}")
print(f"Total number of NaN: {total_nat_count}")
Total number of NaN: 928 Total number of NaN: 0
# Check 'local_time' difference since LOCF method applied Last Observation Carried Forward:
# Set timezone
soil_sensor_data['local_time'] = pd.to_datetime(soil_sensor_data['local_time']).dt.tz_localize(None)
df_soil_sensor_data['local_time'] = pd.to_datetime(df_soil_sensor_data['local_time']).dt.tz_localize(None)
fig, axs = plt.subplots(1, 2, figsize=(15, 6))
# Original
axs[0].hist(soil_sensor_data['local_time'], bins=100, color=color_d[2])
axs[0].set_title('Original Data')
axs[0].set_xlabel('Local Time')
axs[0].set_ylabel('Frequency')
# New
axs[1].hist(df_soil_sensor_data['local_time'], bins=100, color=color_d[3])
axs[1].set_title('After LOCF')
axs[1].set_xlabel('Local Time')
axs[1].set_ylabel('Frequency')
plt.tight_layout()
plt.show()
# look at records before and after the (Original)
soil_sensor_data = soil_sensor_data.sort_values(by='local_time') # Sort
soil_sensor_data['time_diff'] = soil_sensor_data['local_time'].diff()
gap_threshold = pd.Timedelta(days=7)
gaps = soil_sensor_data[soil_sensor_data['time_diff'] > gap_threshold]
print(gaps[['local_time', 'time_diff']])
gap_info = soil_sensor_data[soil_sensor_data['time_diff'] == pd.Timedelta('158 days 21:30:00')]
if not gap_info.empty:
for index, row in gap_info.iterrows():
end_of_gap = row['local_time']
start_of_gap = end_of_gap - row['time_diff']
print(f"Gap starts at {start_of_gap} and ends at {end_of_gap}, lasting {row['time_diff']}")
local_time time_diff 609666 2023-09-02 10:30:00 158 days 21:30:00 Gap starts at 2023-03-27 13:00:00 and ends at 2023-09-02 10:30:00, lasting 158 days 21:30:00
# look at records before and after the gap LOCF
df_soil_sensor_data = df_soil_sensor_data.sort_values(by='local_time') # Sort
df_soil_sensor_data['time_diff'] = df_soil_sensor_data['local_time'].diff()
df_gap_threshold = pd.Timedelta(days=7)
df_gaps = df_soil_sensor_data[df_soil_sensor_data['time_diff'] > df_gap_threshold]
print(df_gaps[['local_time', 'time_diff']])
df_gap_info = df_soil_sensor_data[df_soil_sensor_data['time_diff'] == pd.Timedelta('158 days 21:30:00')]
if not df_gap_info.empty:
for index, row in df_gap_info.iterrows():
df_end_of_gap = row['local_time']
df_start_of_gap = df_end_of_gap - row['time_diff']
print(f"Gap starts at {df_start_of_gap} and ends at {df_end_of_gap}, lasting {row['time_diff']}")
local_time time_diff 609632 2023-09-02 10:30:00 158 days 21:30:00 Gap starts at 2023-03-27 13:00:00 and ends at 2023-09-02 10:30:00, lasting 158 days 21:30:00
# Check soil value missing data
df = df_soil_sensor_data
missing_indices = df[df['soil_value'].isnull()].index
first_missing_index = missing_indices[0]
context_range = 5
df.loc[first_missing_index - context_range:first_missing_index + context_range]
# 'df_soil_sensor_data['soil_values'] where ['site_name_y'] = 'Block 3'
block_3_soil_values = df_soil_sensor_data[df_soil_sensor_data['site_name_y'] == 'Block 3']
print(block_3_soil_values['soil_value'])
315293 37.35
315225 27.22
306775 18.16
315274 21.96
306212 17.56
...
333222 18.44
333210 17.50
333303 17.50
333320 17.90
333221 18.45
Name: soil_value, Length: 67623, dtype: float64
# Find missing soil values based on each site name
grouped_data = df_soil_sensor_data.groupby('site_name_y')
for name, group in grouped_data:
# soil_value = NaN
nan_rows = group[group['soil_value'].isna()]
# NaN rows
if not nan_rows.empty:
print(f"NaN 'soil_value' for site '{name}':")
print(nan_rows[['site_name_y', 'soil_value', 'local_time']])
print()
NaN 'soil_value' for site 'Block 1':
site_name_y soil_value local_time
359244 Block 1 NaN 2023-12-17 19:15:00
359191 Block 1 NaN 2023-12-17 19:15:00
359243 Block 1 NaN 2023-12-17 19:30:00
352619 Block 1 NaN 2023-12-17 19:30:00
352617 Block 1 NaN 2023-12-17 19:45:00
... ... ... ...
369198 Block 1 NaN 2024-02-26 20:45:00
367267 Block 1 NaN 2024-02-26 21:00:00
367278 Block 1 NaN 2024-02-26 21:00:00
367276 Block 1 NaN 2024-02-26 21:15:00
367269 Block 1 NaN 2024-02-26 21:15:00
[80 rows x 3 columns]
NaN 'soil_value' for site 'Block 2':
site_name_y soil_value local_time
121989 Block 2 NaN 2023-12-17 20:15:00
110096 Block 2 NaN 2023-12-17 20:15:00
73292 Block 2 NaN 2023-12-17 20:15:00
121987 Block 2 NaN 2023-12-17 20:30:00
121994 Block 2 NaN 2023-12-17 20:30:00
121991 Block 2 NaN 2023-12-17 20:30:00
121986 Block 2 NaN 2023-12-17 20:45:00
110098 Block 2 NaN 2023-12-17 20:45:00
73293 Block 2 NaN 2023-12-17 20:45:00
121992 Block 2 NaN 2023-12-17 21:00:00
110097 Block 2 NaN 2023-12-17 21:00:00
110095 Block 2 NaN 2023-12-17 21:00:00
120337 Block 2 NaN 2024-01-15 20:15:00
132428 Block 2 NaN 2024-01-15 20:15:00
132419 Block 2 NaN 2024-01-15 20:15:00
132348 Block 2 NaN 2024-01-15 20:30:00
120336 Block 2 NaN 2024-01-15 20:30:00
132356 Block 2 NaN 2024-01-15 20:30:00
120338 Block 2 NaN 2024-01-15 20:45:00
132349 Block 2 NaN 2024-01-15 20:45:00
132358 Block 2 NaN 2024-01-15 20:45:00
120329 Block 2 NaN 2024-01-15 21:00:00
132357 Block 2 NaN 2024-01-15 21:00:00
132359 Block 2 NaN 2024-01-15 21:00:00
133343 Block 2 NaN 2024-01-23 20:15:00
135920 Block 2 NaN 2024-01-23 20:15:00
135919 Block 2 NaN 2024-01-23 20:15:00
133336 Block 2 NaN 2024-01-23 20:30:00
134387 Block 2 NaN 2024-01-23 20:30:00
133344 Block 2 NaN 2024-01-23 20:30:00
133337 Block 2 NaN 2024-01-23 20:45:00
133342 Block 2 NaN 2024-01-23 20:45:00
135921 Block 2 NaN 2024-01-23 20:45:00
135929 Block 2 NaN 2024-01-23 21:00:00
135922 Block 2 NaN 2024-01-23 21:00:00
133338 Block 2 NaN 2024-01-23 21:00:00
137202 Block 2 NaN 2024-02-25 20:15:00
137174 Block 2 NaN 2024-02-25 20:15:00
137165 Block 2 NaN 2024-02-25 20:15:00
135526 Block 2 NaN 2024-02-25 20:30:00
138206 Block 2 NaN 2024-02-25 20:30:00
137200 Block 2 NaN 2024-02-25 20:30:00
137167 Block 2 NaN 2024-02-25 20:45:00
135533 Block 2 NaN 2024-02-25 20:45:00
138205 Block 2 NaN 2024-02-25 20:45:00
135534 Block 2 NaN 2024-02-25 21:00:00
137166 Block 2 NaN 2024-02-25 21:00:00
137201 Block 2 NaN 2024-02-25 21:00:00
135611 Block 2 NaN 2024-02-26 20:15:00
138406 Block 2 NaN 2024-02-26 20:15:00
138414 Block 2 NaN 2024-02-26 20:15:00
137246 Block 2 NaN 2024-02-26 20:30:00
138407 Block 2 NaN 2024-02-26 20:30:00
135614 Block 2 NaN 2024-02-26 20:30:00
135612 Block 2 NaN 2024-02-26 20:45:00
135613 Block 2 NaN 2024-02-26 20:45:00
135609 Block 2 NaN 2024-02-26 20:45:00
135615 Block 2 NaN 2024-02-26 21:00:00
137247 Block 2 NaN 2024-02-26 21:00:00
135608 Block 2 NaN 2024-02-26 21:00:00
NaN 'soil_value' for site 'Block 3':
site_name_y soil_value local_time
301504 Block 3 NaN 2023-12-17 19:15:00
312897 Block 3 NaN 2023-12-17 19:15:00
301509 Block 3 NaN 2023-12-17 19:15:00
265994 Block 3 NaN 2023-12-17 19:30:00
266005 Block 3 NaN 2023-12-17 19:30:00
... ... ... ...
326623 Block 3 NaN 2024-02-26 21:00:00
328316 Block 3 NaN 2024-02-26 21:00:00
329506 Block 3 NaN 2024-02-26 21:15:00
328322 Block 3 NaN 2024-02-26 21:15:00
326627 Block 3 NaN 2024-02-26 21:15:00
[120 rows x 3 columns]
NaN 'soil_value' for site 'Princes Park Lawn 5':
site_name_y soil_value local_time
58981 Princes Park Lawn 5 NaN 2024-01-15 21:00:00
58982 Princes Park Lawn 5 NaN 2024-01-15 21:00:00
58979 Princes Park Lawn 5 NaN 2024-01-15 21:00:00
58980 Princes Park Lawn 5 NaN 2024-01-15 21:00:00
58888 Princes Park Lawn 5 NaN 2024-01-15 21:00:00
... ... ... ...
63607 Princes Park Lawn 5 NaN 2024-02-26 21:00:00
63585 Princes Park Lawn 5 NaN 2024-02-26 21:00:00
67721 Princes Park Lawn 5 NaN 2024-02-26 21:00:00
67718 Princes Park Lawn 5 NaN 2024-02-26 21:00:00
67716 Princes Park Lawn 5 NaN 2024-02-26 21:00:00
[128 rows x 3 columns]
NaN 'soil_value' for site 'Walker Field':
site_name_y soil_value local_time
2089749 Walker Field NaN 2024-01-23 21:00:00
2089750 Walker Field NaN 2024-01-23 21:00:00
2089751 Walker Field NaN 2024-01-23 21:00:00
2089771 Walker Field NaN 2024-01-23 21:00:00
2089268 Walker Field NaN 2024-01-23 21:00:00
2089772 Walker Field NaN 2024-01-23 21:00:00
2089782 Walker Field NaN 2024-01-23 21:00:00
2090528 Walker Field NaN 2024-01-23 21:00:00
2090529 Walker Field NaN 2024-01-23 21:00:00
2089761 Walker Field NaN 2024-01-23 21:00:00
2089267 Walker Field NaN 2024-01-23 21:00:00
2089240 Walker Field NaN 2024-01-23 21:00:00
2089253 Walker Field NaN 2024-01-23 21:00:00
2090539 Walker Field NaN 2024-01-23 21:00:00
2089254 Walker Field NaN 2024-01-23 21:00:00
2090540 Walker Field NaN 2024-01-23 21:00:00
# Find soil values on other days for Block 1
block_1_data = df_soil_sensor_data[df_soil_sensor_data['site_name_y'] == 'Block 1']
date_of_interest = pd.to_datetime('2023-12-17 19:15:00')
date_range_start = date_of_interest - pd.Timedelta(days=2)
date_range_end = date_of_interest + pd.Timedelta(days=2)
surrounding_soil_values = block_1_data[(block_1_data['local_time'] >= date_range_start) &
(block_1_data['local_time'] <= date_range_end) &
(block_1_data['local_time'] != date_of_interest)]
print(surrounding_soil_values[['local_time', 'soil_value']])
unique_rows = surrounding_soil_values[['local_time', 'soil_value']].drop_duplicates()
print(unique_rows)
local_time soil_value
333479 2023-12-15 19:15:00 22.3
358468 2023-12-15 19:15:00 -9999.0
333478 2023-12-15 19:30:00 22.4
333486 2023-12-15 19:30:00 -9999.0
353417 2023-12-15 19:45:00 22.5
... ... ...
361948 2023-12-19 18:45:00 21.0
353684 2023-12-19 19:00:00 -9999.0
353676 2023-12-19 19:00:00 21.1
362008 2023-12-19 19:15:00 -9999.0
353678 2023-12-19 19:15:00 21.2
[352 rows x 2 columns]
local_time soil_value
333479 2023-12-15 19:15:00 22.3
358468 2023-12-15 19:15:00 -9999.0
333478 2023-12-15 19:30:00 22.4
333486 2023-12-15 19:30:00 -9999.0
353417 2023-12-15 19:45:00 22.5
... ... ...
361948 2023-12-19 18:45:00 21.0
353684 2023-12-19 19:00:00 -9999.0
353676 2023-12-19 19:00:00 21.1
362008 2023-12-19 19:15:00 -9999.0
353678 2023-12-19 19:15:00 21.2
[345 rows x 2 columns]
Find Outliers soil_sensor_data¶
- Interquartile range (IQR) to identify values that are significantly higher or lower than the rest of the data.
- Outliers = values that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR, where Q1 and Q3 are the 25th and 75th percentiles of the dataset,use the interquartile range (IQR) to identify values that are significantly higher or lower than the rest of the data.
# IQR
Q1 = unique_rows['soil_value'].quantile(0.25)
Q3 = unique_rows['soil_value'].quantile(0.75)
IQR = Q3 - Q1
# Thresholds & outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = unique_rows[(unique_rows['soil_value'] < lower_bound) | (unique_rows['soil_value'] > upper_bound)]
print(f"Outliers in the dataset: {outliers}")
# Calculate IQR
Q1 = df_soil_sensor_data['soil_value'].quantile(0.25)
Q3 = df_soil_sensor_data['soil_value'].quantile(0.75)
IQR = Q3 - Q1
# Thresholds & outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df_soil_sensor_data[(df_soil_sensor_data['soil_value'] < lower_bound) | (df_soil_sensor_data['soil_value'] > upper_bound)]
print(f"Outliers in the dataset: {outliers}")
outliers.count()
Outliers in the dataset: Empty DataFrame
Columns: [local_time, soil_value]
Index: []
Outliers in the dataset: local_time site_name_x site_id id \
2184734 2022-12-31 13:00:00 Cnr Barry & Leicester 69159 15517638
2258430 2022-12-31 13:00:00 Point Park 65002 15513658
1127023 2022-12-31 13:00:00 Bandstand 75507 15522903
864604 2022-12-31 13:00:00 Kings Domain North 66203 15516538
1811593 2022-12-31 13:00:00 Flagstaff Gardens 3 64991 15520045
... ... ... ... ...
371701 2024-04-01 21:00:00 Block 1 193881 21965543
142083 2024-04-01 21:00:00 Block 2 193854 21965503
72853 2024-04-01 21:00:00 Princes Park Lawn 5 64993 21965571
72941 2024-04-01 21:00:00 Princes Park Lawn 5 64993 21965620
371700 2024-04-01 21:15:00 Block 1 193881 21965464
probe_id probe_measure soil_value unit json_featuretype \
2184734 1392351 Soil Moisture 80cm #0 74.10 %VWC SoilSensorData
2258430 1304995 Soil Moisture 40cm #0 75.26 %VWC SoilSensorData
1127023 1619004 Soil Moisture 80cm #0 65.70 %VWC SoilSensorData
864604 1338328 Soil Moisture 60cm #0 62.58 %VWC SoilSensorData
1811593 1306544 Soil Moisture 70cm #0 75.11 %VWC SoilSensorData
... ... ... ... ... ...
371701 2442600 #4 Temperature -9999.00 ºC Output
142083 2442548 #7 Temperature -9999.00 ºC Output
72853 1303868 Soil Moisture 30cm #0 69.10 %VWC Output
72941 1303869 Soil Moisture 40cm #0 68.41 %VWC Output
371700 2442600 #4 Temperature -9999.00 ºC Output
site_name_y property_name latitude longitude \
2184734 Cnr Barry & Leicester NaN -37.803899 144.960135
2258430 Point Park Point Park -37.823340 144.942270
1127023 Bandstand Fitzroy Gardens -37.811944 144.981126
864604 Kings Domain North Kings Domain -37.822760 144.972350
1811593 Flagstaff Gardens 3 Flagstaff Gardens -37.811880 144.954590
... ... ... ... ...
371701 Block 1 NaN -37.823867 144.925485
142083 Block 2 NaN -37.823877 144.925702
72853 Princes Park Lawn 5 Princes Park -37.780690 144.961300
72941 Princes Park Lawn 5 Princes Park -37.780690 144.961300
371700 Block 1 NaN -37.823867 144.925485
location time_diff
2184734 -37.80389870978553, 144.96013534784527 0 days
2258430 -37.82333999998352, 144.94226999997014 0 days
1127023 -37.811944247241186, 144.98112587268574 0 days
864604 -37.82275999998565, 144.97234999998997 0 days
1811593 -37.81187999998543, 144.9545899999588 0 days
... ... ...
371701 -37.8238673088212, 144.92548478778536 0 days
142083 -37.82387742353656, 144.92570171287727 0 days
72853 -37.78068999999, 144.96129999999133 0 days
72941 -37.78068999999, 144.96129999999133 0 days
371700 -37.8238673088212, 144.92548478778536 0 days
[74411 rows x 15 columns]
local_time 74411 site_name_x 74411 site_id 74411 id 74411 probe_id 74411 probe_measure 74411 soil_value 74411 unit 74411 json_featuretype 74411 site_name_y 74411 property_name 24307 latitude 74411 longitude 74411 location 74411 time_diff 74411 dtype: int64
count_soil_values = df_soil_sensor_data['soil_value'].count()
print("Number of soil values:", count_soil_values)
Number of soil values: 2400279
Find outliers IQR for records (NaN) values soil_value¶
- Data set 2400683 records
- 404 NaN records in ['soil_value']
- 74411 outliers in ['soil_value'] using IQR
# Drop missing indices
missing_indices = df_soil_sensor_data[df_soil_sensor_data['soil_value'].isnull()].index
df_soil_sensor_data.drop(index=missing_indices, inplace=True)
print("Missing indices:", missing_indices)
Missing indices: Index([301504, 312897, 359244, 301509, 359191, 265994, 266005, 359243, 352619,
301505,
...
63607, 63585, 67721, 67718, 67716, 329506, 328322, 367276, 326627,
367269],
dtype='int64', length=404)
Missing values soil_sensor_data['property_values']¶
Missing 1215988 - 1184291 total = 31697
df_soil_sensor_data['property_name'].head()
1876382 Royal Park 464168 NaN 2205917 Carlton Gardens North 197768 NaN 898892 Alexandra Gardens Name: property_name, dtype: object
# Drop missing indices
missing_indices = df_soil_sensor_data[df_soil_sensor_data['property_name'].isnull()].index
df_soil_sensor_data.drop(index=missing_indices, inplace=True)
print("Missing indices:", missing_indices)
Missing indices: Index([ 464168, 197768, 197782, 1105493, 793984, 1105510, 1105511, 1105512,
1105513, 1105515,
...
142083, 333222, 333210, 142073, 142082, 371696, 371700, 333303,
333320, 333221],
dtype='int64', length=1215988)
property_name_values = df_soil_sensor_data['property_name'].count()
print("Number of property_name:", property_name_values)
df_soil_sensor_data = df_soil_sensor_data.drop(columns=['property_name'])
df_soil_sensor_data = df_soil_sensor_data.dropna(subset=['time_diff'])
Number of property_name: 1184291
# Check missing values
mval_soil_sensor = df_soil_sensor_data.isnull().sum()
print(f"Soil sensor data missing values:\n{mval_soil_sensor}")
Soil sensor data missing values: local_time 0 site_name_x 0 site_id 0 id 0 probe_id 0 probe_measure 0 soil_value 0 unit 0 json_featuretype 0 site_name_y 0 latitude 0 longitude 0 location 0 time_diff 0 dtype: int64
Re-check all datasets¶
# Check missing values for dataset
mval_microlabs = microlabs_co2_sensor_data.isnull().sum()
mval_microclimate = microclimate_sensor_readings.isnull().sum()
mval_soil_sensor = df_soil_sensor_data.isnull().sum()
mval_microlabs_locations = microclimate_sensor_locations.isnull().sum()
print(f"microlabs_co2_sensor_data: {mval_microlabs}")
print(f"microclimate_sensor_readings: {mval_microclimate}")
print(f"soil_sensor_data: {mval_soil_sensor}")
print(f"soil_sensor_data: {mval_microlabs_locations}")
microlabs_co2_sensor_data: dev_id 0 sensor_name 0 time 0 temperature 0 humidity 0 light 0 motion 0 visit 0 vdd 0 lat_long 8 co2 0 latitude 0 longitude 0 date 0 dtype: int64 microclimate_sensor_readings: local_time 0 id 0 site_id 0 sensor_id 0 value 0 type 0 units 0 gatewayhub_id 0 site_status 0 dtype: int64 soil_sensor_data: local_time 0 site_name_x 0 site_id 0 id 0 probe_id 0 probe_measure 0 soil_value 0 unit 0 json_featuretype 0 site_name_y 0 latitude 0 longitude 0 location 0 time_diff 0 dtype: int64 soil_sensor_data: site_id 0 gatewayhub_id 0 site_status 0 start_reading 0 latitude 0 longitude 0 location 0 dtype: int64
All dataset info¶
- df_soil_sensor_data
- microlabs_co2_sensor_data
- microclimate_sensor_readings
# View data types
microlabs_co2_sensor_data.info()
microclimate_sensor_readings.info()
microclimate_sensor_locations.info()
df_soil_sensor_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 207521 entries, 0 to 207520 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dev_id 207521 non-null object 1 sensor_name 207521 non-null object 2 time 207521 non-null datetime64[ns, UTC] 3 temperature 207521 non-null float64 4 humidity 207521 non-null int64 5 light 207521 non-null float64 6 motion 207521 non-null int64 7 visit 207521 non-null int64 8 vdd 207521 non-null int64 9 lat_long 207513 non-null object 10 co2 207521 non-null float64 11 latitude 207521 non-null object 12 longitude 207521 non-null object 13 date 207521 non-null object dtypes: datetime64[ns, UTC](1), float64(3), int64(4), object(6) memory usage: 22.2+ MB <class 'pandas.core.frame.DataFrame'> RangeIndex: 56 entries, 0 to 55 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 local_time 56 non-null datetime64[ns, UTC] 1 id 56 non-null int64 2 site_id 56 non-null int64 3 sensor_id 56 non-null object 4 value 56 non-null float64 5 type 56 non-null object 6 units 56 non-null object 7 gatewayhub_id 56 non-null object 8 site_status 56 non-null object dtypes: datetime64[ns, UTC](1), float64(1), int64(2), object(5) memory usage: 4.1+ KB <class 'pandas.core.frame.DataFrame'> RangeIndex: 16 entries, 0 to 15 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 site_id 16 non-null int64 1 gatewayhub_id 16 non-null object 2 site_status 16 non-null object 3 start_reading 16 non-null datetime64[ns] 4 latitude 16 non-null float64 5 longitude 16 non-null float64 6 location 16 non-null object dtypes: datetime64[ns](1), float64(2), int64(1), object(3) memory usage: 1.0+ KB <class 'pandas.core.frame.DataFrame'> Index: 1184290 entries, 2205917 to 72856 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 local_time 1184290 non-null datetime64[ns] 1 site_name_x 1184290 non-null object 2 site_id 1184290 non-null int64 3 id 1184290 non-null int64 4 probe_id 1184290 non-null int64 5 probe_measure 1184290 non-null object 6 soil_value 1184290 non-null float64 7 unit 1184290 non-null object 8 json_featuretype 1184290 non-null object 9 site_name_y 1184290 non-null object 10 latitude 1184290 non-null float64 11 longitude 1184290 non-null float64 12 location 1184290 non-null object 13 time_diff 1184290 non-null timedelta64[ns] dtypes: datetime64[ns](1), float64(3), int64(3), object(6), timedelta64[ns](1) memory usage: 135.5+ MB
#Drop time_diff
df_soil_sensor_data.drop(['time_diff'], axis=1, inplace=True)
df_soil_sensor_data.head()
| local_time | site_name_x | site_id | id | probe_id | probe_measure | soil_value | unit | json_featuretype | site_name_y | latitude | longitude | location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2205917 | 2022-12-31 13:00:00 | Carlton Gardens North | 64979 | 15522826 | 1306504 | Soil Moisture 50cm #0 | 32.63 | %VWC | SoilSensorData | Carlton Gardens North | -37.80170 | 144.97138 | -37.801699999983285, 144.9713799999935 |
| 898892 | 2022-12-31 13:00:00 | Alexandra Gardens Engineers Lawn | 66195 | 15522987 | 1338740 | Soil Temperature 50cm #0 | 21.36 | ºC | SoilSensorData | Alexandra Gardens Engineers Lawn | -37.82073 | 144.97254 | -37.82072999998553, 144.97253999999268 |
| 2248612 | 2022-12-31 13:00:00 | 9th Fairway | 64976 | 15515162 | 1320442 | Soil Salinity 20cm #0 | 0.05 | µS/cm | SoilSensorData | 9th Fairway | -37.78292 | 144.95566 | -37.78291999998904, 144.95566000000628 |
| 575652 | 2022-12-31 13:00:00 | Shrine Reserve North East | 64986 | 15521623 | 1305387 | Soil Moisture 50cm #0 | 51.39 | %VWC | SoilSensorData | Shrine Reserve North East | -37.82921 | 144.97348 | -37.8292099999859, 144.97347999998564 |
| 2205914 | 2022-12-31 13:00:00 | Carlton Gardens North | 64979 | 15514301 | 1306520 | Soil Temperature 30cm #0 | 20.63 | ºC | SoilSensorData | Carlton Gardens North | -37.80170 | 144.97138 | -37.801699999983285, 144.9713799999935 |
# Unique records df_soil_sensor_data
unique_dev_keys_in_soil_sensor_data = df_soil_sensor_data['id'].nunique()
print(f"Number of unique 'id' values in df_soil_sensor_data: {unique_dev_keys_in_soil_sensor_data}")
total_data_soil_sensor_data = df_soil_sensor_data.shape[0]
print(f"Total number of elements in df_soil_sensor_data: {total_data_soil_sensor_data}")
Number of unique 'id' values in df_soil_sensor_data: 1184290 Total number of elements in df_soil_sensor_data: 1184290
# Unique records microclimate_sensor_readings
unique_dev_microclimate_sensor_readings = microclimate_sensor_readings['id'].nunique()
print(f"Number of unique 'id' values in unique_dev_microclimate_sensor_readings: {unique_dev_microclimate_sensor_readings}")
total_data_microclimate_sensor_readings = microclimate_sensor_readings.shape[0]
print(f"Total number of elements in unique_dev_microclimate_sensor_readings: {total_data_microclimate_sensor_readings}")
Number of unique 'id' values in unique_dev_microclimate_sensor_readings: 56 Total number of elements in unique_dev_microclimate_sensor_readings: 56
# Unique records microlabs_co2_sensor_data
unique_dev_microlabs_co2_sensor_data = microlabs_co2_sensor_data['dev_id'].nunique()
print(f"Number of unique 'id' values in microlabs_co2_sensor_data: {unique_dev_microlabs_co2_sensor_data}")
total_data_microlabs_co2_sensor_data = microlabs_co2_sensor_data.shape[0]
print(f"Total number of elements in microlabs_co2_sensor_data: {total_data_microlabs_co2_sensor_data}")
Number of unique 'id' values in microlabs_co2_sensor_data: 1 Total number of elements in microlabs_co2_sensor_data: 207521
# Reset index for each row
microlabs_co2_sensor_data = microlabs_co2_sensor_data.reset_index(drop=True)
# Unique records microclimate_sensor_locations
unique_dev_microclimate_sensor_locations = microclimate_sensor_locations['site_id'].nunique()
print(f"Number of unique 'id' values in microclimate_sensor_locations: {unique_dev_microclimate_sensor_locations}")
total_data_microclimate_sensor_locations = microclimate_sensor_locations.shape[0]
print(f"Total number of elements in microclimate_sensor_locations: {total_data_microclimate_sensor_locations}")
Number of unique 'id' values in microclimate_sensor_locations: 16 Total number of elements in microclimate_sensor_locations: 16
All datasets check duplicates¶
# Check duplicates
duplicates_co2 = microlabs_co2_sensor_data.duplicated().sum()
duplicates_microclimate = microclimate_sensor_readings.duplicated().sum()
duplicates_locations = microclimate_sensor_locations.duplicated().sum()
duplicates_soil = df_soil_sensor_data.duplicated().sum()
print("Number of duplicate rows in CO2 sensor data:", duplicates_co2)
print("Number of duplicate rows in microclimate sensor readings:", duplicates_microclimate)
print("Number of duplicate rows in microclimate sensor locations:", duplicates_locations)
print("Number of duplicate rows in soil sensor data:", duplicates_soil)
Number of duplicate rows in CO2 sensor data: 0 Number of duplicate rows in microclimate sensor readings: 0 Number of duplicate rows in microclimate sensor locations: 0 Number of duplicate rows in soil sensor data: 0
Merge Datasets microclimate_sensor_readings microclimate_sensor_locations¶
- microclimate_sensor_readings
- microclimate_sensor_locations
microclimate_sensor = pd.merge(microclimate_sensor_locations, microclimate_sensor_readings, on='site_id', how='outer')
microclimate_sensor = microclimate_sensor.dropna(subset=['id']) # Drop NaN
microclimate_sensor.head()
| site_id | gatewayhub_id_x | site_status_x | start_reading | latitude | longitude | location | local_time | id | sensor_id | value | type | units | gatewayhub_id_y | site_status_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:15:02+00:00 | 8110553.0 | 5b | 74.38 | TPH.RH | % | arc1055 | C |
| 3 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:15:02+00:00 | 8110558.0 | 5c | 1010.80 | TPH.PRESSURE | hPa | arc1055 | C |
| 4 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:15:03+00:00 | 8110562.0 | 6 | 0.00 | WS | km/h | arc1055 | C |
| 5 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:30:02+00:00 | 8110554.0 | 5b | 74.19 | TPH.RH | % | arc1055 | C |
| 6 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:30:11+00:00 | 8110523.0 | 0a | 5.90 | PM2.5 | ug\/m^3 | arc1055 | C |
number_of_records = microclimate_sensor.shape[0] # Num records
print(f"Number of records in microclimate_sensor: {number_of_records}")
Number of records in microclimate_sensor: 56
microlabs_co2_sensor_data.head()
| dev_id | sensor_name | time | temperature | humidity | light | motion | visit | vdd | lat_long | co2 | latitude | longitude | date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ers-co2-9da0 | ERS CO2 | 2022-05-04 01:40:25+00:00 | 20.9 | 47 | 441.0 | 0 | 0 | 3601 | -37.8133183, 144.9668853 | 507.0 | -37.8133183 | 144.9668853 | 2022-05-04 |
| 1 | ers-co2-9da0 | ERS CO2 | 2022-05-04 01:33:25+00:00 | 20.9 | 47 | 505.0 | 0 | 0 | 3599 | NaN | 525.0 | -37.8133183 | 144.9668853 | 2022-05-04 |
| 2 | ers-co2-9da0 | ERS CO2 | 2022-05-04 01:34:35+00:00 | 20.9 | 47 | 528.0 | 0 | 0 | 3601 | NaN | 527.0 | -37.8133183 | 144.9668853 | 2022-05-04 |
| 3 | ers-co2-9da0 | ERS CO2 | 2022-05-04 01:31:05+00:00 | 20.9 | 47 | 441.0 | 0 | 0 | 3596 | NaN | 535.0 | -37.8133183 | 144.9668853 | 2022-05-04 |
| 4 | ers-co2-9da0 | ERS CO2 | 2022-05-04 08:25:14+00:00 | 20.8 | 46 | 90.0 | 0 | 0 | 3596 | -37.8133183, 144.9668853 | 495.0 | -37.8133183 | 144.9668853 | 2022-05-04 |
microclimate_sensor.head()
| site_id | gatewayhub_id_x | site_status_x | start_reading | latitude | longitude | location | local_time | id | sensor_id | value | type | units | gatewayhub_id_y | site_status_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:15:02+00:00 | 8110553.0 | 5b | 74.38 | TPH.RH | % | arc1055 | C |
| 3 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:15:02+00:00 | 8110558.0 | 5c | 1010.80 | TPH.PRESSURE | hPa | arc1055 | C |
| 4 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:15:03+00:00 | 8110562.0 | 6 | 0.00 | WS | km/h | arc1055 | C |
| 5 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:30:02+00:00 | 8110554.0 | 5b | 74.19 | TPH.RH | % | arc1055 | C |
| 6 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:30:11+00:00 | 8110523.0 | 0a | 5.90 | PM2.5 | ug\/m^3 | arc1055 | C |
df_soil_sensor_data.head()
| local_time | site_name_x | site_id | id | probe_id | probe_measure | soil_value | unit | json_featuretype | site_name_y | latitude | longitude | location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2205917 | 2022-12-31 13:00:00 | Carlton Gardens North | 64979 | 15522826 | 1306504 | Soil Moisture 50cm #0 | 32.63 | %VWC | SoilSensorData | Carlton Gardens North | -37.80170 | 144.97138 | -37.801699999983285, 144.9713799999935 |
| 898892 | 2022-12-31 13:00:00 | Alexandra Gardens Engineers Lawn | 66195 | 15522987 | 1338740 | Soil Temperature 50cm #0 | 21.36 | ºC | SoilSensorData | Alexandra Gardens Engineers Lawn | -37.82073 | 144.97254 | -37.82072999998553, 144.97253999999268 |
| 2248612 | 2022-12-31 13:00:00 | 9th Fairway | 64976 | 15515162 | 1320442 | Soil Salinity 20cm #0 | 0.05 | µS/cm | SoilSensorData | 9th Fairway | -37.78292 | 144.95566 | -37.78291999998904, 144.95566000000628 |
| 575652 | 2022-12-31 13:00:00 | Shrine Reserve North East | 64986 | 15521623 | 1305387 | Soil Moisture 50cm #0 | 51.39 | %VWC | SoilSensorData | Shrine Reserve North East | -37.82921 | 144.97348 | -37.8292099999859, 144.97347999998564 |
| 2205914 | 2022-12-31 13:00:00 | Carlton Gardens North | 64979 | 15514301 | 1306520 | Soil Temperature 30cm #0 | 20.63 | ºC | SoilSensorData | Carlton Gardens North | -37.80170 | 144.97138 | -37.801699999983285, 144.9713799999935 |
A. Save Datasets¶
- microclimate_sensor
- microlabs_co2_sensor_data
- df_soil_sensor_data
# # Save the DataFrame to a CSV file
microclimate_sensor.to_csv('/content/drive/My Drive/sit378_eco_metrix/microclimate_sensor_cleaned.csv', index=False)
microlabs_co2_sensor_data.to_csv('/content/drive/My Drive/sit378_eco_metrix/microlabs_co2_sensor_data_cleaned.csv', index=False)
df_soil_sensor_data.to_csv('/content/drive/My Drive/sit378_eco_metrix/df_soil_sensor_data_cleaned.csv', index=False)
User Interface - View Datasets¶
B. Load datasets¶
- microclimate_sensor
- microlabs_co2_sensor_data
- df_soil_sensor_data
# load datasets
microclimate_sensor_df = pd.read_csv('/content/drive/My Drive/sit378_eco_metrix/microclimate_sensor_cleaned.csv')
microlabs_co2_sensor_data_df = pd.read_csv('/content/drive/My Drive/sit378_eco_metrix/microlabs_co2_sensor_data_cleaned.csv')
df_soil_sensor_data_df = pd.read_csv('/content/drive/My Drive/sit378_eco_metrix/df_soil_sensor_data_cleaned.csv')
microclimate_sensor = pd.read_csv('/content/drive/My Drive/sit378_eco_metrix/microclimate_sensor_cleaned.csv')
# dictionary
dataframes = {
'Microclimate Sensor': microclimate_sensor_df,
'Microlabs CO2 Sensor Data': microlabs_co2_sensor_data_df,
'Soil Sensor Data': df_soil_sensor_data_df
}
# Widget
dataset_dropdown = widgets.Dropdown(
options=list(dataframes.keys()),
description='Dataset:',
disabled=False,
)
# update/ change menu data
def dataset_change(change):
selected_dataset = change['new']
display(dataframes[selected_dataset])
# Watch for changes on the dataset dropdown
dataset_dropdown.observe(dataset_change, names='value')
# Display the widget
display(dataset_dropdown)
Dropdown(description='Dataset:', options=('Microclimate Sensor', 'Microlabs CO2 Sensor Data', 'Soil Sensor Dat…
| dev_id | sensor_name | time | temperature | humidity | light | motion | visit | vdd | lat_long | co2 | latitude | longitude | date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ers-co2-9da0 | ERS CO2 | 2022-05-04 01:40:25+00:00 | 20.9 | 47 | 441.0 | 0 | 0 | 3601 | -37.8133183, 144.9668853 | 507.0 | -37.813318 | 144.966885 | 2022-05-04 |
| 1 | ers-co2-9da0 | ERS CO2 | 2022-05-04 01:33:25+00:00 | 20.9 | 47 | 505.0 | 0 | 0 | 3599 | NaN | 525.0 | -37.813318 | 144.966885 | 2022-05-04 |
| 2 | ers-co2-9da0 | ERS CO2 | 2022-05-04 01:34:35+00:00 | 20.9 | 47 | 528.0 | 0 | 0 | 3601 | NaN | 527.0 | -37.813318 | 144.966885 | 2022-05-04 |
| 3 | ers-co2-9da0 | ERS CO2 | 2022-05-04 01:31:05+00:00 | 20.9 | 47 | 441.0 | 0 | 0 | 3596 | NaN | 535.0 | -37.813318 | 144.966885 | 2022-05-04 |
| 4 | ers-co2-9da0 | ERS CO2 | 2022-05-04 08:25:14+00:00 | 20.8 | 46 | 90.0 | 0 | 0 | 3596 | -37.8133183, 144.9668853 | 495.0 | -37.813318 | 144.966885 | 2022-05-04 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 207516 | ers-co2-9da0 | ERS CO2 | 2024-04-02 22:52:51+00:00 | 20.1 | 53 | 377.0 | 2 | 1 | 3651 | -37.8133183, 144.9668853 | 425.0 | -37.813318 | 144.966885 | 2024-04-02 |
| 207517 | ers-co2-9da0 | ERS CO2 | 2024-04-02 23:07:51+00:00 | 21.2 | 52 | 394.0 | 0 | 0 | 3648 | -37.8133183, 144.9668853 | 417.0 | -37.813318 | 144.966885 | 2024-04-02 |
| 207518 | ers-co2-9da0 | ERS CO2 | 2024-04-02 23:37:51+00:00 | 19.6 | 57 | 309.0 | 0 | 0 | 3648 | -37.8133183, 144.9668853 | 410.0 | -37.813318 | 144.966885 | 2024-04-02 |
| 207519 | ers-co2-9da0 | ERS CO2 | 2024-04-02 23:47:51+00:00 | 21.9 | 51 | 400.0 | 4 | 1 | 3648 | -37.8133183, 144.9668853 | 415.0 | -37.813318 | 144.966885 | 2024-04-02 |
| 207520 | ers-co2-9da0 | ERS CO2 | 2024-04-02 23:52:50+00:00 | 21.2 | 52 | 478.0 | 1 | 1 | 3648 | -37.8133183, 144.9668853 | 413.0 | -37.813318 | 144.966885 | 2024-04-02 |
207521 rows × 14 columns
Soil Data Pre-Processing¶
- Define time/ date columns
# Extract date and Drop columns
df_soil_sensor_data = df_soil_sensor_data_df.copy()
df_soil_sensor_data['time'] = pd.to_datetime(df_soil_sensor_data['local_time'])
df_soil_sensor_data['date'] = df_soil_sensor_data['time'].dt.date
df_soil_sensor_data = df_soil_sensor_data.drop(columns=['local_time', 'site_name_x', 'probe_id', 'json_featuretype', 'site_name_y'])
# Redefine time
df_soil_sensor_data['time'] = pd.to_datetime(df_soil_sensor_data['time'])
df_soil_sensor_data['time'] = df_soil_sensor_data['time'].dt.time
df_soil_sensor_data.head()
| site_id | id | probe_measure | soil_value | unit | latitude | longitude | location | time | date | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 64979 | 15522826 | Soil Moisture 50cm #0 | 32.63 | %VWC | -37.80170 | 144.97138 | -37.801699999983285, 144.9713799999935 | 13:00:00 | 2022-12-31 |
| 1 | 66195 | 15522987 | Soil Temperature 50cm #0 | 21.36 | ºC | -37.82073 | 144.97254 | -37.82072999998553, 144.97253999999268 | 13:00:00 | 2022-12-31 |
| 2 | 64976 | 15515162 | Soil Salinity 20cm #0 | 0.05 | µS/cm | -37.78292 | 144.95566 | -37.78291999998904, 144.95566000000628 | 13:00:00 | 2022-12-31 |
| 3 | 64986 | 15521623 | Soil Moisture 50cm #0 | 51.39 | %VWC | -37.82921 | 144.97348 | -37.8292099999859, 144.97347999998564 | 13:00:00 | 2022-12-31 |
| 4 | 64979 | 15514301 | Soil Temperature 30cm #0 | 20.63 | ºC | -37.80170 | 144.97138 | -37.801699999983285, 144.9713799999935 | 13:00:00 | 2022-12-31 |
# Extract date and Drop columns
microlabs_co2_sensor_data = microlabs_co2_sensor_data_df.copy()
microlabs_co2_sensor_data['time'] = pd.to_datetime(microlabs_co2_sensor_data['time'])
microlabs_co2_sensor_data['date'] = microlabs_co2_sensor_data['time'].dt.date
microlabs_co2_sensor_data = microlabs_co2_sensor_data.drop(columns=['sensor_name'])
# Fill lat_long
microlabs_co2_sensor_data['lat_long'] = '-37.8133183, 144.9668853'
# Redefine time
microlabs_co2_sensor_data['time'] = pd.to_datetime(microlabs_co2_sensor_data['time'])
microlabs_co2_sensor_data['time'] = microlabs_co2_sensor_data['time'].dt.time
microlabs_co2_sensor_data.head()
| dev_id | time | temperature | humidity | light | motion | visit | vdd | lat_long | co2 | latitude | longitude | date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ers-co2-9da0 | 01:40:25 | 20.9 | 47 | 441.0 | 0 | 0 | 3601 | -37.8133183, 144.9668853 | 507.0 | -37.813318 | 144.966885 | 2022-05-04 |
| 1 | ers-co2-9da0 | 01:33:25 | 20.9 | 47 | 505.0 | 0 | 0 | 3599 | -37.8133183, 144.9668853 | 525.0 | -37.813318 | 144.966885 | 2022-05-04 |
| 2 | ers-co2-9da0 | 01:34:35 | 20.9 | 47 | 528.0 | 0 | 0 | 3601 | -37.8133183, 144.9668853 | 527.0 | -37.813318 | 144.966885 | 2022-05-04 |
| 3 | ers-co2-9da0 | 01:31:05 | 20.9 | 47 | 441.0 | 0 | 0 | 3596 | -37.8133183, 144.9668853 | 535.0 | -37.813318 | 144.966885 | 2022-05-04 |
| 4 | ers-co2-9da0 | 08:25:14 | 20.8 | 46 | 90.0 | 0 | 0 | 3596 | -37.8133183, 144.9668853 | 495.0 | -37.813318 | 144.966885 | 2022-05-04 |
Merge Datasets¶
- Using Date and Time
- Set datetime and then update index
- Sort data and merge using 1 hour intervals
- Normalise the data
# Date time one column
# Before concatenation convert to string
df_soil_sensor_data['datetime'] = pd.to_datetime(df_soil_sensor_data['date'].astype(str) + ' ' + df_soil_sensor_data['time'].astype(str))
microlabs_co2_sensor_data['datetime'] = pd.to_datetime(microlabs_co2_sensor_data['date'].astype(str) + ' ' + microlabs_co2_sensor_data['time'].astype(str))
# index update
df_soil_sensor_data.set_index('datetime', inplace=True)
microlabs_co2_sensor_data.set_index('datetime', inplace=True)
# Sort
df_soil_sensor_data.sort_index(inplace=True)
microlabs_co2_sensor_data.sort_index(inplace=True)
# Merge (1 hr limit)
soil_co2_sensor_data = pd.merge_asof(df_soil_sensor_data, microlabs_co2_sensor_data, on='datetime', tolerance=pd.Timedelta('1H'), direction='nearest')
print(soil_co2_sensor_data.head())
datetime site_id id probe_measure \
0 2022-12-31 13:00:00 64979 15522826 Soil Moisture 50cm #0
1 2022-12-31 13:00:00 66195 15522987 Soil Temperature 50cm #0
2 2022-12-31 13:00:00 64976 15515162 Soil Salinity 20cm #0
3 2022-12-31 13:00:00 64986 15521623 Soil Moisture 50cm #0
4 2022-12-31 13:00:00 64979 15514301 Soil Temperature 30cm #0
soil_value unit latitude_x longitude_x \
0 32.63 %VWC -37.80170 144.97138
1 21.36 ºC -37.82073 144.97254
2 0.05 µS/cm -37.78292 144.95566
3 51.39 %VWC -37.82921 144.97348
4 20.63 ºC -37.80170 144.97138
location time_x date_x dev_id \
0 -37.801699999983285, 144.9713799999935 13:00:00 2022-12-31 ers-co2-9da0
1 -37.82072999998553, 144.97253999999268 13:00:00 2022-12-31 ers-co2-9da0
2 -37.78291999998904, 144.95566000000628 13:00:00 2022-12-31 ers-co2-9da0
3 -37.8292099999859, 144.97347999998564 13:00:00 2022-12-31 ers-co2-9da0
4 -37.801699999983285, 144.9713799999935 13:00:00 2022-12-31 ers-co2-9da0
time_y temperature humidity light motion visit vdd \
0 13:00:46 24.5 61 0.0 0 0 3656
1 13:00:46 24.5 61 0.0 0 0 3656
2 13:00:46 24.5 61 0.0 0 0 3656
3 13:00:46 24.5 61 0.0 0 0 3656
4 13:00:46 24.5 61 0.0 0 0 3656
lat_long co2 latitude_y longitude_y date_y
0 -37.8133183, 144.9668853 416.0 -37.813318 144.966885 2022-12-31
1 -37.8133183, 144.9668853 416.0 -37.813318 144.966885 2022-12-31
2 -37.8133183, 144.9668853 416.0 -37.813318 144.966885 2022-12-31
3 -37.8133183, 144.9668853 416.0 -37.813318 144.966885 2022-12-31
4 -37.8133183, 144.9668853 416.0 -37.813318 144.966885 2022-12-31
Separate the values (units)¶
- %VWC: Volumetric water content is a numerical measure of soil moisture. It is simply the ratio of water volume to soil volume.
- ºC: Temperature in degrees celsius.
- µS/cm: Salinity is the measure of the concentration of dissolved (soluble) salts in water from all sources. (microseimens per centimetre)
- NaN values
- Zero if it represents no measurement or an expected value of zero.
- Mean or median of the column if the data distribution
print(soil_co2_sensor_data['probe_measure'].unique())
['Soil Moisture 50cm #0' 'Soil Temperature 50cm #0' 'Soil Salinity 20cm #0' 'Soil Temperature 30cm #0' 'Soil Moisture 40cm #0 (Clay)' 'Soil Moisture 40cm #0' 'Soil Salinity 30cm #0' 'Soil Moisture 40cm #0 (Sand)' 'Soil Moisture 20cm #0 (Sand)' 'Soil Moisture 10cm #0 (Sand)' 'Soil Temperature 10cm #0' 'Soil Temperature 40cm #0' 'Soil Moisture 10cm #0' 'Soil Moisture 80cm #0 (Clay)' 'Soil Moisture 30cm #0 (Clay)' 'Soil Moisture 80cm #0' 'Soil Temperature 20cm #0' 'Soil Moisture 20cm #0 (Clayey Loam)' 'Soil Moisture 30cm #0 (Clayey Loam)' 'Soil Moisture 70cm #0 (Clay)' 'Soil Salinity 10cm #0' 'Soil Salinity 40 cm #0' '#0 Soil Moisture 50cm' 'Soil Moisture 50cm #0 (Clay)' 'Soil Moisture 30cm #0' 'Soil Temperature 80cm #0' '#0 Soil Moisture 60cm' '#0 Soil Moisture 80cm' '#0 Soil Moisture 20cm' 'Soil Moisture 20cm #0' 'Soil Temperature 70cm #0' 'Soil Moisture 30cm #0 (Sandy Loam)' 'Soil Moisture 40cm #0 (Clayey Loam)' 'Soil Moisture 10cm #0 (Clayey Loam)' 'Soil Salinity 40cm #0' '#0 Soil Moisture 70cm' '#0 Soil Moisture 30cm' '#0 Soil Moisture 10cm' 'Soil Salinity 60cm #0' 'Soil Moisture 60cm #0' 'Soil Moisture 50cm #0 (Sandy Loam)' 'Soil Moisture 70cm #0' 'Soil Salinity 50cm #0' 'Soil Moisture 80cm #0 (Sandy Loam)' 'Soil Moisture 70cm #0 (Sandy Loam)' 'Soil Temperature 60cm #0' 'Soil Moisture 60cm #0 (Clay)' 'Soil Salinity 70cm #0' '#0 Soil Moisture 10cm Clay Loam' 'Soil Moisture 10cm #0 (Sandy Loam)' 'Soil Salinity 80cm #0' 'Soil Moisture 20cm #0 (Sandy Loam)' 'Soil Moisture 60cm #0 (Loam)' 'Soil Moisture 50cm #0 (Loam)' 'Soil Moisture 40cm #0 (Loam)' '#0 Soil Moisture 40cm' 'Soil Moisture 70cm #0 (Loam)' 'Soil Moisture 80cm #0 (Loam)' 'Soil Moisture 30cm #0 (Sand)' 'Soil Moisture 40cm #0 (Sandy Loam)' 'Soil Moisture 60cm #0 (Sandy Loam)']
# Function to filter data by unit
def pivot_by_unit(dataframe, unit, fill_value=None):
filtered_df = dataframe[dataframe['unit'] == unit]
pivot = filtered_df.pivot_table(
index=['datetime', 'site_id'],
columns='probe_measure',
values='soil_value',
aggfunc='first'
).reset_index()
# Fill NaNs selectively
if fill_value is not None:
pivot.fillna(fill_value, inplace=True)
elif fill_value is None: # Fill with mean for numeric data
numeric_cols = pivot.select_dtypes(include=['number']).columns
pivot[numeric_cols] = pivot[numeric_cols].fillna(pivot[numeric_cols].mean())
return pivot
# Apply function to each unit
df_vwc = pivot_by_unit(soil_co2_sensor_data, '%VWC')
df_temp = pivot_by_unit(soil_co2_sensor_data, 'ºC')
df_salinity = pivot_by_unit(soil_co2_sensor_data, 'µS/cm')
# Display the pivoted data
print("VWC Data:")
print(df_vwc)
print("\nTemperature Data:")
print(df_temp)
print("\nSalinity Data:")
print(df_salinity)
VWC Data: probe_measure datetime site_id #0 Soil Moisture 10cm \ 0 2022-12-31 13:00:00 64970 30.763335 1 2022-12-31 13:00:00 64973 30.763335 2 2022-12-31 13:00:00 64974 30.763335 3 2022-12-31 13:00:00 64975 30.763335 4 2022-12-31 13:00:00 64976 30.763335 ... ... ... ... 45324 2024-04-01 17:00:00 75508 27.340000 45325 2024-04-01 17:00:00 75509 38.580000 45326 2024-04-01 19:00:00 64987 30.763335 45327 2024-04-01 19:00:00 64993 30.763335 45328 2024-04-01 21:00:00 64993 30.763335 probe_measure #0 Soil Moisture 10cm Clay Loam #0 Soil Moisture 20cm \ 0 27.205734 30.770719 1 27.205734 30.770719 2 27.205734 30.770719 3 27.205734 30.770719 4 27.205734 30.770719 ... ... ... 45324 27.205734 41.060000 45325 27.205734 35.940000 45326 27.205734 30.770719 45327 27.205734 30.770719 45328 27.205734 30.770719 probe_measure #0 Soil Moisture 30cm #0 Soil Moisture 40cm \ 0 29.190185 30.752335 1 29.190185 30.752335 2 29.190185 30.752335 3 29.190185 30.752335 4 29.190185 30.752335 ... ... ... 45324 45.520000 48.950000 45325 25.110000 29.130000 45326 29.190185 30.752335 45327 29.190185 30.752335 45328 29.190185 30.752335 probe_measure #0 Soil Moisture 50cm #0 Soil Moisture 60cm \ 0 34.021261 35.01883 1 34.021261 35.01883 2 34.021261 35.01883 3 34.021261 35.01883 4 34.021261 35.01883 ... ... ... 45324 35.540000 23.79000 45325 37.400000 52.93000 45326 34.021261 35.01883 45327 34.021261 35.01883 45328 34.021261 35.01883 probe_measure #0 Soil Moisture 70cm #0 Soil Moisture 80cm \ 0 36.078887 36.761917 1 36.078887 36.761917 2 36.078887 36.761917 3 36.078887 36.761917 4 36.078887 36.761917 ... ... ... 45324 22.910000 25.730000 45325 60.020000 55.820000 45326 36.078887 36.761917 45327 36.078887 36.761917 45328 36.078887 36.761917 probe_measure Soil Moisture 10cm #0 Soil Moisture 10cm #0 (Clayey Loam) \ 0 15.97 30.2806 1 20.37 30.2806 2 27.43 30.2806 3 41.74 30.2806 4 25.92 30.2806 ... ... ... 45324 27.34 30.2806 45325 38.58 30.2806 45326 35.78 30.2806 45327 40.71 32.4800 45328 41.29 32.6500 probe_measure Soil Moisture 10cm #0 (Sand) \ 0 15.170000 1 18.503858 2 18.503858 3 18.503858 4 18.503858 ... ... 45324 18.503858 45325 18.503858 45326 18.503858 45327 18.503858 45328 18.503858 probe_measure Soil Moisture 10cm #0 (Sandy Loam) Soil Moisture 20cm #0 \ 0 24.920853 23.21 1 19.780000 16.90 2 22.550000 30.93 3 26.450000 49.01 4 22.020000 26.18 ... ... ... 45324 24.920853 41.06 45325 24.920853 35.94 45326 25.020000 42.07 45327 24.920853 57.58 45328 24.920853 57.50 probe_measure Soil Moisture 20cm #0 (Clayey Loam) \ 0 31.545542 1 31.545542 2 31.545542 3 31.545542 4 31.545542 ... ... 45324 31.545542 45325 31.545542 45326 31.545542 45327 36.640000 45328 36.620000 probe_measure Soil Moisture 20cm #0 (Sand) \ 0 18.160000 1 19.865721 2 19.865721 3 19.865721 4 19.865721 ... ... 45324 19.865721 45325 19.865721 45326 19.865721 45327 19.865721 45328 19.865721 probe_measure Soil Moisture 20cm #0 (Sandy Loam) Soil Moisture 30cm #0 \ 0 24.607374 35.91 1 18.040000 18.01 2 23.670000 28.62 3 27.950000 43.82 4 22.110000 25.73 ... ... ... 45324 24.607374 45.52 45325 24.607374 25.11 45326 26.530000 36.42 45327 24.607374 69.20 45328 24.607374 69.10 probe_measure Soil Moisture 30cm #0 (Clay) \ 0 37.105925 1 37.105925 2 31.950000 3 37.105925 4 37.105925 ... ... 45324 37.105925 45325 37.105925 45326 37.105925 45327 43.780000 45328 43.760000 probe_measure Soil Moisture 30cm #0 (Clayey Loam) \ 0 30.959696 1 30.959696 2 30.959696 3 30.959696 4 30.959696 ... ... 45324 30.959696 45325 30.959696 45326 30.959696 45327 30.959696 45328 30.959696 probe_measure Soil Moisture 30cm #0 (Sand) \ 0 21.650000 1 22.199535 2 22.199535 3 22.199535 4 22.199535 ... ... 45324 22.199535 45325 22.199535 45326 22.199535 45327 22.199535 45328 22.199535 probe_measure Soil Moisture 30cm #0 (Sandy Loam) Soil Moisture 40cm #0 \ 0 24.40166 40.99 1 18.64000 24.58 2 24.40166 30.26 3 26.90000 48.07 4 21.95000 36.43 ... ... ... 45324 24.40166 48.95 45325 24.40166 29.13 45326 25.18000 46.83 45327 24.40166 68.50 45328 24.40166 68.41 probe_measure Soil Moisture 40cm #0 (Clay) \ 0 36.760000 1 29.910000 2 32.690000 3 38.890000 4 35.148909 ... ... 45324 35.148909 45325 35.148909 45326 38.540000 45327 43.640000 45328 43.620000 probe_measure Soil Moisture 40cm #0 (Clayey Loam) \ 0 34.802169 1 34.802169 2 34.802169 3 34.802169 4 31.140000 ... ... 45324 34.802169 45325 34.802169 45326 34.802169 45327 34.802169 45328 34.802169 probe_measure Soil Moisture 40cm #0 (Loam) Soil Moisture 40cm #0 (Sand) \ 0 25.891705 17.270834 1 25.891705 17.270834 2 25.891705 17.270834 3 25.891705 17.270834 4 25.891705 17.270834 ... ... ... 45324 25.891705 17.270834 45325 25.891705 17.270834 45326 25.891705 17.270834 45327 25.891705 17.270834 45328 25.891705 17.270834 probe_measure Soil Moisture 40cm #0 (Sandy Loam) Soil Moisture 50cm #0 \ 0 25.739159 39.573004 1 25.739159 33.420000 2 25.739159 39.573004 3 25.739159 39.573004 4 25.739159 39.573004 ... ... ... 45324 25.739159 35.540000 45325 25.739159 37.400000 45326 25.739159 39.573004 45327 25.739159 60.320000 45328 25.739159 60.060000 probe_measure Soil Moisture 50cm #0 (Clay) Soil Moisture 50cm #0 (Loam) \ 0 36.298941 29.051403 1 34.020000 29.051403 2 36.298941 29.051403 3 36.298941 29.051403 4 36.298941 29.051403 ... ... ... 45324 36.298941 29.051403 45325 36.298941 29.051403 45326 36.298941 29.051403 45327 41.940000 29.051403 45328 41.880000 29.051403 probe_measure Soil Moisture 50cm #0 (Sandy Loam) Soil Moisture 60cm #0 \ 0 29.900554 40.837718 1 29.900554 31.980000 2 29.900554 40.837718 3 29.900554 40.837718 4 29.900554 40.837718 ... ... ... 45324 29.900554 23.790000 45325 29.900554 52.930000 45326 29.900554 40.837718 45327 29.900554 53.320000 45328 29.900554 53.330000 probe_measure Soil Moisture 60cm #0 (Clay) Soil Moisture 60cm #0 (Loam) \ 0 36.894756 25.61878 1 33.430000 25.61878 2 36.894756 25.61878 3 36.894756 25.61878 4 36.894756 25.61878 ... ... ... 45324 36.894756 25.61878 45325 36.894756 25.61878 45326 36.894756 25.61878 45327 40.280000 25.61878 45328 40.290000 25.61878 probe_measure Soil Moisture 60cm #0 (Sandy Loam) Soil Moisture 70cm #0 \ 0 30.48723 42.863582 1 30.48723 25.550000 2 30.48723 42.863582 3 30.48723 42.863582 4 30.48723 42.863582 ... ... ... 45324 30.48723 22.910000 45325 30.48723 60.020000 45326 30.48723 42.863582 45327 30.48723 55.340000 45328 30.48723 55.360000 probe_measure Soil Moisture 70cm #0 (Clay) Soil Moisture 70cm #0 (Loam) \ 0 37.622749 26.198141 1 30.420000 26.198141 2 37.622749 26.198141 3 37.622749 26.198141 4 37.622749 26.198141 ... ... ... 45324 37.622749 26.198141 45325 37.622749 26.198141 45326 37.622749 26.198141 45327 40.780000 26.198141 45328 40.790000 26.198141 probe_measure Soil Moisture 70cm #0 (Sandy Loam) Soil Moisture 80cm #0 \ 0 30.474749 44.521533 1 30.474749 35.850000 2 30.474749 44.521533 3 30.474749 44.521533 4 30.474749 44.521533 ... ... ... 45324 30.474749 25.730000 45325 30.474749 55.820000 45326 30.474749 44.521533 45327 30.474749 55.020000 45328 30.474749 55.010000 probe_measure Soil Moisture 80cm #0 (Clay) Soil Moisture 80cm #0 (Loam) \ 0 38.279374 25.771367 1 34.960000 25.771367 2 38.279374 25.771367 3 38.279374 25.771367 4 38.279374 25.771367 ... ... ... 45324 38.279374 25.771367 45325 38.279374 25.771367 45326 38.279374 25.771367 45327 40.700000 25.771367 45328 40.700000 25.771367 probe_measure Soil Moisture 80cm #0 (Sandy Loam) 0 29.86471 1 29.86471 2 29.86471 3 29.86471 4 29.86471 ... ... 45324 29.86471 45325 29.86471 45326 29.86471 45327 29.86471 45328 29.86471 [45329 rows x 46 columns] Temperature Data: probe_measure datetime site_id Soil Temperature 10cm #0 \ 0 2022-12-31 13:00:00 64970 20.00 1 2022-12-31 13:00:00 64973 17.83 2 2022-12-31 13:00:00 64974 20.48 3 2022-12-31 13:00:00 64975 21.84 4 2022-12-31 13:00:00 64976 21.90 ... ... ... ... 45283 2024-04-01 17:00:00 75508 18.78 45284 2024-04-01 17:00:00 75509 18.73 45285 2024-04-01 19:00:00 64987 18.19 45286 2024-04-01 19:00:00 64993 19.13 45287 2024-04-01 21:00:00 64993 18.99 probe_measure Soil Temperature 20cm #0 Soil Temperature 30cm #0 \ 0 19.19 18.93 1 17.22 16.55 2 20.30 20.04 3 20.73 20.62 4 20.55 20.54 ... ... ... 45283 17.66 17.77 45284 17.55 17.61 45285 18.00 17.79 45286 18.35 17.99 45287 18.59 18.11 probe_measure Soil Temperature 40cm #0 Soil Temperature 50cm #0 \ 0 19.22 17.607582 1 16.26 15.950000 2 19.95 17.607582 3 20.53 17.607582 4 20.88 17.607582 ... ... ... 45283 17.99 18.310000 45284 17.73 17.810000 45285 17.97 17.607582 45286 18.09 18.250000 45287 18.19 18.240000 probe_measure Soil Temperature 60cm #0 Soil Temperature 70cm #0 \ 0 17.478256 17.334296 1 15.920000 15.790000 2 17.478256 17.334296 3 17.478256 17.334296 4 17.478256 17.334296 ... ... ... 45283 18.020000 18.020000 45284 18.050000 18.100000 45285 17.478256 17.334296 45286 18.090000 18.350000 45287 18.080000 18.280000 probe_measure Soil Temperature 80cm #0 0 17.245231 1 15.710000 2 17.245231 3 17.245231 4 17.245231 ... ... 45283 18.020000 45284 18.140000 45285 17.245231 45286 18.390000 45287 18.390000 [45288 rows x 10 columns] Salinity Data: probe_measure datetime site_id Soil Salinity 10cm #0 \ 0 2022-12-31 13:00:00 64970 0.01 1 2022-12-31 13:00:00 64973 0.07 2 2022-12-31 13:00:00 64974 0.22 3 2022-12-31 13:00:00 64975 0.28 4 2022-12-31 13:00:00 64976 0.07 ... ... ... ... 37074 2024-04-01 17:00:00 66199 0.20 37075 2024-04-01 17:00:00 66204 0.14 37076 2024-04-01 19:00:00 64987 0.20 37077 2024-04-01 19:00:00 64993 0.20 37078 2024-04-01 21:00:00 64993 0.20 probe_measure Soil Salinity 20cm #0 Soil Salinity 30cm #0 \ 0 0.10 0.30 1 0.06 0.06 2 0.25 0.07 3 0.29 0.14 4 0.05 0.03 ... ... ... 37074 0.39 0.08 37075 0.60 0.34 37076 0.26 0.19 37077 0.43 0.86 37078 0.43 0.86 probe_measure Soil Salinity 40 cm #0 Soil Salinity 40cm #0 \ 0 0.340000 0.289172 1 0.483892 0.120000 2 0.070000 0.289172 3 0.340000 0.289172 4 0.270000 0.289172 ... ... ... 37074 0.483892 0.070000 37075 0.483892 0.100000 37076 0.970000 0.289172 37077 0.483892 1.070000 37078 0.483892 1.080000 probe_measure Soil Salinity 50cm #0 Soil Salinity 60cm #0 \ 0 0.440727 0.620244 1 0.370000 0.460000 2 0.440727 0.620244 3 0.440727 0.620244 4 0.440727 0.620244 ... ... ... 37074 0.950000 1.080000 37075 0.110000 0.090000 37076 0.440727 0.620244 37077 1.380000 1.640000 37078 1.380000 1.630000 probe_measure Soil Salinity 70cm #0 Soil Salinity 80cm #0 0 0.573859 0.723331 1 0.390000 0.840000 2 0.573859 0.723331 3 0.573859 0.723331 4 0.573859 0.723331 ... ... ... 37074 0.430000 0.560000 37075 0.180000 0.490000 37076 0.573859 0.723331 37077 1.210000 1.750000 37078 1.210000 1.740000 [37079 rows x 11 columns]
Correlation Heatmap¶
- Volumetric Water Content (VWC), Temperature and Salinity
data_frames = [df_vwc, df_temp, df_salinity]
df_merged = pd.concat(data_frames, axis=1)
# Drop duplicated columns if any occur during merging
df_merged = df_merged.loc[:,~df_merged.columns.duplicated()]
# Calculating the correlation matrix
correlation_matrix = df_merged.corr()
# Plotting the correlation matrix without annotations
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, cmap='coolwarm', cbar=True)
plt.title("Correlation Matrix of VWC, Temperature, and Salinity")
plt.show()
Time series plot¶
print(soil_co2_sensor_data['site_id'].unique())
[64979 66195 64976 64986 64981 64984 64989 65002 64982 64998 75509 64991 75502 75507 65012 75503 64990 75500 66203 64977 64993 65005 64973 65014 64975 66190 66187 64999 64970 65010 64988 65011 64974 64996 64980 75508 64971 66192 65013 66198 64985 66204 75505 65007 66199 75504 64987]
print(soil_co2_sensor_data['probe_measure'].unique())
['Soil Moisture 50cm #0' 'Soil Temperature 50cm #0' 'Soil Salinity 20cm #0' 'Soil Temperature 30cm #0' 'Soil Moisture 40cm #0 (Clay)' 'Soil Moisture 40cm #0' 'Soil Salinity 30cm #0' 'Soil Moisture 40cm #0 (Sand)' 'Soil Moisture 20cm #0 (Sand)' 'Soil Moisture 10cm #0 (Sand)' 'Soil Temperature 10cm #0' 'Soil Temperature 40cm #0' 'Soil Moisture 10cm #0' 'Soil Moisture 80cm #0 (Clay)' 'Soil Moisture 30cm #0 (Clay)' 'Soil Moisture 80cm #0' 'Soil Temperature 20cm #0' 'Soil Moisture 20cm #0 (Clayey Loam)' 'Soil Moisture 30cm #0 (Clayey Loam)' 'Soil Moisture 70cm #0 (Clay)' 'Soil Salinity 10cm #0' 'Soil Salinity 40 cm #0' '#0 Soil Moisture 50cm' 'Soil Moisture 50cm #0 (Clay)' 'Soil Moisture 30cm #0' 'Soil Temperature 80cm #0' '#0 Soil Moisture 60cm' '#0 Soil Moisture 80cm' '#0 Soil Moisture 20cm' 'Soil Moisture 20cm #0' 'Soil Temperature 70cm #0' 'Soil Moisture 30cm #0 (Sandy Loam)' 'Soil Moisture 40cm #0 (Clayey Loam)' 'Soil Moisture 10cm #0 (Clayey Loam)' 'Soil Salinity 40cm #0' '#0 Soil Moisture 70cm' '#0 Soil Moisture 30cm' '#0 Soil Moisture 10cm' 'Soil Salinity 60cm #0' 'Soil Moisture 60cm #0' 'Soil Moisture 50cm #0 (Sandy Loam)' 'Soil Moisture 70cm #0' 'Soil Salinity 50cm #0' 'Soil Moisture 80cm #0 (Sandy Loam)' 'Soil Moisture 70cm #0 (Sandy Loam)' 'Soil Temperature 60cm #0' 'Soil Moisture 60cm #0 (Clay)' 'Soil Salinity 70cm #0' '#0 Soil Moisture 10cm Clay Loam' 'Soil Moisture 10cm #0 (Sandy Loam)' 'Soil Salinity 80cm #0' 'Soil Moisture 20cm #0 (Sandy Loam)' 'Soil Moisture 60cm #0 (Loam)' 'Soil Moisture 50cm #0 (Loam)' 'Soil Moisture 40cm #0 (Loam)' '#0 Soil Moisture 40cm' 'Soil Moisture 70cm #0 (Loam)' 'Soil Moisture 80cm #0 (Loam)' 'Soil Moisture 30cm #0 (Sand)' 'Soil Moisture 40cm #0 (Sandy Loam)' 'Soil Moisture 60cm #0 (Sandy Loam)']
# Categorise each probe_measurement
probe_measures = soil_co2_sensor_data['probe_measure'].unique()
moisture_measures = []
temperature_measures = []
salinity_measures = []
depth_measures = set()
soil_type_measures = set()
for measure in probe_measures:
if 'Moisture' in measure:
moisture_measures.append(measure)
# Depth info and Soil type
depth = re.search(r'\d+cm', measure)
if depth:
depth_measures.add(depth.group())
soil_type = re.search(r'\((.*?)\)', measure)
if soil_type:
soil_type_measures.add(soil_type.group(1))
elif 'Temperature' in measure:
temperature_measures.append(measure)
# Depth info
depth = re.search(r'\d+cm', measure)
if depth:
depth_measures.add(depth.group())
elif 'Salinity' in measure:
salinity_measures.append(measure)
# Depth info
depth = re.search(r'\d+cm', measure)
if depth:
depth_measures.add(depth.group())
print('Moisture Measures:', moisture_measures)
print('Temperature Measures:', temperature_measures)
print('Salinity Measures:', salinity_measures)
print('Depth Measures:', depth_measures)
print('Soil Type Measures:', soil_type_measures)
Moisture Measures: ['Soil Moisture 50cm #0', 'Soil Moisture 40cm #0 (Clay)', 'Soil Moisture 40cm #0', 'Soil Moisture 40cm #0 (Sand)', 'Soil Moisture 20cm #0 (Sand)', 'Soil Moisture 10cm #0 (Sand)', 'Soil Moisture 10cm #0', 'Soil Moisture 80cm #0 (Clay)', 'Soil Moisture 30cm #0 (Clay)', 'Soil Moisture 80cm #0', 'Soil Moisture 20cm #0 (Clayey Loam)', 'Soil Moisture 30cm #0 (Clayey Loam)', 'Soil Moisture 70cm #0 (Clay)', '#0 Soil Moisture 50cm', 'Soil Moisture 50cm #0 (Clay)', 'Soil Moisture 30cm #0', '#0 Soil Moisture 60cm', '#0 Soil Moisture 80cm', '#0 Soil Moisture 20cm', 'Soil Moisture 20cm #0', 'Soil Moisture 30cm #0 (Sandy Loam)', 'Soil Moisture 40cm #0 (Clayey Loam)', 'Soil Moisture 10cm #0 (Clayey Loam)', '#0 Soil Moisture 70cm', '#0 Soil Moisture 30cm', '#0 Soil Moisture 10cm', 'Soil Moisture 60cm #0', 'Soil Moisture 50cm #0 (Sandy Loam)', 'Soil Moisture 70cm #0', 'Soil Moisture 80cm #0 (Sandy Loam)', 'Soil Moisture 70cm #0 (Sandy Loam)', 'Soil Moisture 60cm #0 (Clay)', '#0 Soil Moisture 10cm Clay Loam', 'Soil Moisture 10cm #0 (Sandy Loam)', 'Soil Moisture 20cm #0 (Sandy Loam)', 'Soil Moisture 60cm #0 (Loam)', 'Soil Moisture 50cm #0 (Loam)', 'Soil Moisture 40cm #0 (Loam)', '#0 Soil Moisture 40cm', 'Soil Moisture 70cm #0 (Loam)', 'Soil Moisture 80cm #0 (Loam)', 'Soil Moisture 30cm #0 (Sand)', 'Soil Moisture 40cm #0 (Sandy Loam)', 'Soil Moisture 60cm #0 (Sandy Loam)']
Temperature Measures: ['Soil Temperature 50cm #0', 'Soil Temperature 30cm #0', 'Soil Temperature 10cm #0', 'Soil Temperature 40cm #0', 'Soil Temperature 20cm #0', 'Soil Temperature 80cm #0', 'Soil Temperature 70cm #0', 'Soil Temperature 60cm #0']
Salinity Measures: ['Soil Salinity 20cm #0', 'Soil Salinity 30cm #0', 'Soil Salinity 10cm #0', 'Soil Salinity 40 cm #0', 'Soil Salinity 40cm #0', 'Soil Salinity 60cm #0', 'Soil Salinity 50cm #0', 'Soil Salinity 70cm #0', 'Soil Salinity 80cm #0']
Depth Measures: {'10cm', '70cm', '50cm', '40cm', '30cm', '80cm', '60cm', '20cm'}
Soil Type Measures: {'Loam', 'Sand', 'Clay', 'Sandy Loam', 'Clayey Loam'}
# Details of each probe_measurement
clay_measures = []
sand_measures = []
loam_measures = []
sandy_loam_measures = []
clayey_loam_measures = []
other_measures = []
for measure in moisture_measures:
soil_type = re.search(r'\((.*?)\)', measure)
depth = re.search(r'\d+cm', measure)
if soil_type:
soil_type_str = soil_type.group(1)
if 'Clay' in soil_type_str and 'Clayey Loam' not in soil_type_str:
clay_measures.append(measure)
elif 'Sand' in soil_type_str and 'Sandy Loam' not in soil_type_str:
sand_measures.append(measure)
elif 'Loam' in soil_type_str and 'Sandy Loam' not in soil_type_str and 'Clayey Loam' not in soil_type_str:
loam_measures.append(measure)
elif 'Sandy Loam' in soil_type_str:
sandy_loam_measures.append(measure)
elif 'Clayey Loam' in soil_type_str:
clayey_loam_measures.append(measure)
else:
other_measures.append(measure)
print("Clay Measures:")
clay_soil_types = list(set([re.search(r'\((.*?)\)', m).group(1) for m in clay_measures if re.search(r'\((.*?)\)', m)]))
clay_depths = list(set([re.search(r'\d+cm', m).group() for m in clay_measures if re.search(r'\d+cm', m)]))
print("Soil Types:", clay_soil_types)
print("Depths:", clay_depths)
print(clay_measures)
print("\nSand Measures:")
sand_soil_types = list(set([re.search(r'\((.*?)\)', m).group(1) for m in sand_measures if re.search(r'\((.*?)\)', m)]))
sand_depths = list(set([re.search(r'\d+cm', m).group() for m in sand_measures if re.search(r'\d+cm', m)]))
print("Soil Types:", sand_soil_types)
print("Depths:", sand_depths)
print(sand_measures)
print("\nLoam Measures:")
loam_soil_types = list(set([re.search(r'\((.*?)\)', m).group(1) for m in loam_measures if re.search(r'\((.*?)\)', m)]))
loam_depths = list(set([re.search(r'\d+cm', m).group() for m in loam_measures if re.search(r'\d+cm', m)]))
print("Soil Types:", loam_soil_types)
print("Depths:", loam_depths)
print(loam_measures)
print("\nSandy Loam Measures:")
sandy_loam_soil_types = list(set([re.search(r'\((.*?)\)', m).group(1) for m in sandy_loam_measures if re.search(r'\((.*?)\)', m)]))
sandy_loam_depths = list(set([re.search(r'\d+cm', m).group() for m in sandy_loam_measures if re.search(r'\d+cm', m)]))
print("Soil Types:", sandy_loam_soil_types)
print("Depths:", sandy_loam_depths)
print(sandy_loam_measures)
print("\nClayey Loam Measures:")
clayey_loam_soil_types = list(set([re.search(r'\((.*?)\)', m).group(1) for m in clayey_loam_measures if re.search(r'\((.*?)\)', m)]))
clayey_loam_depths = list(set([re.search(r'\d+cm', m).group() for m in clayey_loam_measures if re.search(r'\d+cm', m)]))
print("Soil Types:", clayey_loam_soil_types)
print("Depths:", clayey_loam_depths)
print(clayey_loam_measures)
print("\nOther Measures:")
other_measures_depths = list(set([re.search(r'\d+cm', m).group() for m in other_measures if re.search(r'\d+cm', m)]))
print("Depths:", other_measures_depths)
print(other_measures)
Clay Measures: Soil Types: ['Clay'] Depths: ['70cm', '50cm', '40cm', '30cm', '80cm', '60cm'] ['Soil Moisture 40cm #0 (Clay)', 'Soil Moisture 80cm #0 (Clay)', 'Soil Moisture 30cm #0 (Clay)', 'Soil Moisture 70cm #0 (Clay)', 'Soil Moisture 50cm #0 (Clay)', 'Soil Moisture 60cm #0 (Clay)'] Sand Measures: Soil Types: ['Sand'] Depths: ['10cm', '40cm', '30cm', '20cm'] ['Soil Moisture 40cm #0 (Sand)', 'Soil Moisture 20cm #0 (Sand)', 'Soil Moisture 10cm #0 (Sand)', 'Soil Moisture 30cm #0 (Sand)'] Loam Measures: Soil Types: ['Loam'] Depths: ['70cm', '50cm', '40cm', '80cm', '60cm'] ['Soil Moisture 60cm #0 (Loam)', 'Soil Moisture 50cm #0 (Loam)', 'Soil Moisture 40cm #0 (Loam)', 'Soil Moisture 70cm #0 (Loam)', 'Soil Moisture 80cm #0 (Loam)'] Sandy Loam Measures: Soil Types: ['Sandy Loam'] Depths: ['10cm', '70cm', '50cm', '40cm', '30cm', '80cm', '60cm', '20cm'] ['Soil Moisture 30cm #0 (Sandy Loam)', 'Soil Moisture 50cm #0 (Sandy Loam)', 'Soil Moisture 80cm #0 (Sandy Loam)', 'Soil Moisture 70cm #0 (Sandy Loam)', 'Soil Moisture 10cm #0 (Sandy Loam)', 'Soil Moisture 20cm #0 (Sandy Loam)', 'Soil Moisture 40cm #0 (Sandy Loam)', 'Soil Moisture 60cm #0 (Sandy Loam)'] Clayey Loam Measures: Soil Types: ['Clayey Loam'] Depths: ['10cm', '40cm', '30cm', '20cm'] ['Soil Moisture 20cm #0 (Clayey Loam)', 'Soil Moisture 30cm #0 (Clayey Loam)', 'Soil Moisture 40cm #0 (Clayey Loam)', 'Soil Moisture 10cm #0 (Clayey Loam)'] Other Measures: Depths: ['10cm', '70cm', '50cm', '40cm', '30cm', '80cm', '60cm', '20cm'] ['Soil Moisture 50cm #0', 'Soil Moisture 40cm #0', 'Soil Moisture 10cm #0', 'Soil Moisture 80cm #0', '#0 Soil Moisture 50cm', 'Soil Moisture 30cm #0', '#0 Soil Moisture 60cm', '#0 Soil Moisture 80cm', '#0 Soil Moisture 20cm', 'Soil Moisture 20cm #0', '#0 Soil Moisture 70cm', '#0 Soil Moisture 30cm', '#0 Soil Moisture 10cm', 'Soil Moisture 60cm #0', 'Soil Moisture 70cm #0', '#0 Soil Moisture 10cm Clay Loam', '#0 Soil Moisture 40cm']
Function for time series mapping soil_data¶
Monthly median value probe measure & site_id¶
- Soil Values Monthly median values for each probe measure and site
# Function for Time series
def plot_filtered_medians(dataframe, measures, title):
if not measures:
print("No measures provided.")
return
filtered_data = dataframe[['datetime', 'site_id'] + measures]
monthly_medians = filtered_data.groupby(['site_id', pd.Grouper(key='datetime', freq='M')]).median().reset_index()
num_plots = len(measures)
num_rows = math.ceil(num_plots / 3)
num_cols = min(3, max(1, num_plots))
fig, axs = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(20, 5 * num_rows), squeeze=False)
unique_sites = dataframe['site_id'].unique()
line_objects = {} # Create one legend
for i, measure in enumerate(measures):
row, col = i // 3, i % 3
ax = axs[row, col] if num_rows > 1 else axs[col]
for site in unique_sites:
site_data = monthly_medians[(monthly_medians['site_id'] == site) & (monthly_medians[measure].notna())]
if not site_data.empty:
line, = ax.plot(site_data['datetime'], site_data[measure], label=f'Site {site}')
if site not in line_objects:
line_objects[site] = line # each site
ax.set_title(f'{measure}')
ax.set_xlabel('Date')
ax.set_ylabel('Median Value')
for ax in fig.axes:
plt.setp(ax.get_xticklabels(), rotation=45, ha='right')
fig.legend(handles=line_objects.values(), labels=[f'Site {site}' for site in unique_sites], loc='center right', bbox_to_anchor=(1.05, 0.5), ncol=1, title="Sites")
plt.tight_layout()
plt.subplots_adjust(right=0.9)
plt.show()
plot_filtered_medians(df_merged, clay_measures, "Monthly Median Values for Clay Measures")
# moisture_measures
moisture_measures = [col for col in df_merged.columns if 'Moisture' in col]
plot_filtered_medians(df_merged, moisture_measures, "Monthly Median Values for Moisture Measures")
# salinity_measures
salinity_measures = [col for col in df_merged.columns if 'Salinity' in col]
plot_filtered_medians(df_merged, salinity_measures, "Monthly Median Values for Salinity Measures")
Soil data Reshape (interpolate) using Pivot table¶
- interpolate(method='time') fills missing values based on linear interpolation over time.
# Pivot table - reshape data
pivoted_data = soil_co2_sensor_data.pivot_table(index='datetime', columns='probe_measure', values='soil_value', aggfunc='first')
pivoted_data.interpolate(method='time', inplace=True)
print(pivoted_data.head())
probe_measure #0 Soil Moisture 10cm #0 Soil Moisture 10cm Clay Loam \ datetime 2022-12-31 13:00:00 22.28 26.680000 2022-12-31 15:00:00 28.94 26.640000 2022-12-31 17:00:00 17.98 26.690000 2022-12-31 19:00:00 21.56 26.654286 2022-12-31 21:00:00 21.05 26.618571 probe_measure #0 Soil Moisture 20cm #0 Soil Moisture 30cm \ datetime 2022-12-31 13:00:00 28.21 15.54 2022-12-31 15:00:00 28.20 30.04 2022-12-31 17:00:00 26.11 17.92 2022-12-31 19:00:00 26.08 17.90 2022-12-31 21:00:00 26.06 17.94 probe_measure #0 Soil Moisture 40cm #0 Soil Moisture 50cm \ datetime 2022-12-31 13:00:00 34.67 30.98 2022-12-31 15:00:00 34.65 47.13 2022-12-31 17:00:00 34.57 42.47 2022-12-31 19:00:00 20.48 30.57 2022-12-31 21:00:00 20.43 30.56 probe_measure #0 Soil Moisture 60cm #0 Soil Moisture 70cm \ datetime 2022-12-31 13:00:00 34.14 51.67 2022-12-31 15:00:00 41.87 32.78 2022-12-31 17:00:00 42.39 54.53 2022-12-31 19:00:00 41.54 50.76 2022-12-31 21:00:00 41.48 50.56 probe_measure #0 Soil Moisture 80cm Soil Moisture 10cm #0 \ datetime 2022-12-31 13:00:00 35.38 19.77 2022-12-31 15:00:00 35.57 20.07 2022-12-31 17:00:00 55.72 25.13 2022-12-31 19:00:00 55.55 19.49 2022-12-31 21:00:00 55.43 20.09 probe_measure Soil Moisture 10cm #0 (Clayey Loam) \ datetime 2022-12-31 13:00:00 27.62 2022-12-31 15:00:00 22.92 2022-12-31 17:00:00 23.04 2022-12-31 19:00:00 23.05 2022-12-31 21:00:00 23.02 probe_measure Soil Moisture 10cm #0 (Sand) \ datetime 2022-12-31 13:00:00 16.870000 2022-12-31 15:00:00 16.890000 2022-12-31 17:00:00 14.640000 2022-12-31 19:00:00 16.760000 2022-12-31 21:00:00 16.308333 probe_measure Soil Moisture 10cm #0 (Sandy Loam) \ datetime 2022-12-31 13:00:00 19.46 2022-12-31 15:00:00 20.77 2022-12-31 17:00:00 30.66 2022-12-31 19:00:00 19.79 2022-12-31 21:00:00 19.65 probe_measure Soil Moisture 20cm #0 \ datetime 2022-12-31 13:00:00 26.18 2022-12-31 15:00:00 31.15 2022-12-31 17:00:00 16.07 2022-12-31 19:00:00 26.08 2022-12-31 21:00:00 26.06 probe_measure Soil Moisture 20cm #0 (Clayey Loam) \ datetime 2022-12-31 13:00:00 31.35 2022-12-31 15:00:00 30.29 2022-12-31 17:00:00 28.75 2022-12-31 19:00:00 26.02 2022-12-31 21:00:00 26.01 probe_measure Soil Moisture 20cm #0 (Sand) \ datetime 2022-12-31 13:00:00 18.650000 2022-12-31 15:00:00 18.630000 2022-12-31 17:00:00 15.680000 2022-12-31 19:00:00 18.590000 2022-12-31 21:00:00 18.466667 probe_measure Soil Moisture 20cm #0 (Sandy Loam) \ datetime 2022-12-31 13:00:00 22.11 2022-12-31 15:00:00 25.59 2022-12-31 17:00:00 18.00 2022-12-31 19:00:00 21.57 2022-12-31 21:00:00 21.60 probe_measure Soil Moisture 30cm #0 Soil Moisture 30cm #0 (Clay) \ datetime 2022-12-31 13:00:00 24.15 39.02 2022-12-31 15:00:00 27.59 32.08 2022-12-31 17:00:00 30.76 31.79 2022-12-31 19:00:00 27.57 32.91 2022-12-31 21:00:00 30.83 32.07 probe_measure Soil Moisture 30cm #0 (Clayey Loam) \ datetime 2022-12-31 13:00:00 37.230000 2022-12-31 15:00:00 26.280000 2022-12-31 17:00:00 30.460000 2022-12-31 19:00:00 30.452857 2022-12-31 21:00:00 30.445714 probe_measure Soil Moisture 30cm #0 (Sand) \ datetime 2022-12-31 13:00:00 21.650000 2022-12-31 15:00:00 19.540000 2022-12-31 17:00:00 21.600000 2022-12-31 19:00:00 19.530000 2022-12-31 21:00:00 19.868333 probe_measure Soil Moisture 30cm #0 (Sandy Loam) \ datetime 2022-12-31 13:00:00 21.950000 2022-12-31 15:00:00 25.150000 2022-12-31 17:00:00 28.410000 2022-12-31 19:00:00 28.118571 2022-12-31 21:00:00 27.827143 probe_measure Soil Moisture 40cm #0 Soil Moisture 40cm #0 (Clay) \ datetime 2022-12-31 13:00:00 46.97 38.58 2022-12-31 15:00:00 12.30 31.34 2022-12-31 17:00:00 27.97 31.28 2022-12-31 19:00:00 40.52 35.35 2022-12-31 21:00:00 36.83 35.32 probe_measure Soil Moisture 40cm #0 (Clayey Loam) \ datetime 2022-12-31 13:00:00 31.140000 2022-12-31 15:00:00 37.110000 2022-12-31 17:00:00 31.030000 2022-12-31 19:00:00 31.898571 2022-12-31 21:00:00 32.767143 probe_measure Soil Moisture 40cm #0 (Loam) \ datetime 2022-12-31 13:00:00 29.520000 2022-12-31 15:00:00 31.850000 2022-12-31 17:00:00 30.410000 2022-12-31 19:00:00 30.315714 2022-12-31 21:00:00 30.221429 probe_measure Soil Moisture 40cm #0 (Sand) \ datetime 2022-12-31 13:00:00 13.080000 2022-12-31 15:00:00 13.080000 2022-12-31 17:00:00 13.070000 2022-12-31 19:00:00 13.100000 2022-12-31 21:00:00 13.092857 probe_measure Soil Moisture 40cm #0 (Sandy Loam) \ datetime 2022-12-31 13:00:00 27.230000 2022-12-31 15:00:00 27.230000 2022-12-31 17:00:00 27.220000 2022-12-31 19:00:00 27.214286 2022-12-31 21:00:00 27.208571 probe_measure Soil Moisture 50cm #0 Soil Moisture 50cm #0 (Clay) \ datetime 2022-12-31 13:00:00 32.63 39.79 2022-12-31 15:00:00 34.08 41.30 2022-12-31 17:00:00 38.16 33.97 2022-12-31 19:00:00 30.57 36.90 2022-12-31 21:00:00 41.40 36.89 probe_measure Soil Moisture 50cm #0 (Loam) \ datetime 2022-12-31 13:00:00 30.78 2022-12-31 15:00:00 33.14 2022-12-31 17:00:00 32.16 2022-12-31 19:00:00 32.11 2022-12-31 21:00:00 32.06 probe_measure Soil Moisture 50cm #0 (Sandy Loam) \ datetime 2022-12-31 13:00:00 29.640000 2022-12-31 15:00:00 29.630000 2022-12-31 17:00:00 29.620000 2022-12-31 19:00:00 29.615714 2022-12-31 21:00:00 29.611429 probe_measure Soil Moisture 60cm #0 Soil Moisture 60cm #0 (Clay) \ datetime 2022-12-31 13:00:00 49.62 37.74 2022-12-31 15:00:00 41.87 36.77 2022-12-31 17:00:00 16.78 33.38 2022-12-31 19:00:00 55.02 40.70 2022-12-31 21:00:00 54.75 40.64 probe_measure Soil Moisture 60cm #0 (Loam) \ datetime 2022-12-31 13:00:00 27.510000 2022-12-31 15:00:00 30.290000 2022-12-31 17:00:00 28.740000 2022-12-31 19:00:00 28.682857 2022-12-31 21:00:00 28.625714 probe_measure Soil Moisture 60cm #0 (Sandy Loam) \ datetime 2022-12-31 13:00:00 30.180000 2022-12-31 15:00:00 30.180000 2022-12-31 17:00:00 30.180000 2022-12-31 19:00:00 30.181429 2022-12-31 21:00:00 30.182857 probe_measure Soil Moisture 70cm #0 Soil Moisture 70cm #0 (Clay) \ datetime 2022-12-31 13:00:00 16.25 39.00 2022-12-31 15:00:00 58.48 34.26 2022-12-31 17:00:00 64.63 34.21 2022-12-31 19:00:00 64.34 42.80 2022-12-31 21:00:00 50.56 42.77 probe_measure Soil Moisture 70cm #0 (Loam) \ datetime 2022-12-31 13:00:00 28.990000 2022-12-31 15:00:00 31.590000 2022-12-31 17:00:00 30.760000 2022-12-31 19:00:00 30.695714 2022-12-31 21:00:00 30.631429 probe_measure Soil Moisture 70cm #0 (Sandy Loam) \ datetime 2022-12-31 13:00:00 30.18 2022-12-31 15:00:00 30.18 2022-12-31 17:00:00 30.18 2022-12-31 19:00:00 30.18 2022-12-31 21:00:00 30.18 probe_measure Soil Moisture 80cm #0 Soil Moisture 80cm #0 (Clay) \ datetime 2022-12-31 13:00:00 54.65 28.46 2022-12-31 15:00:00 37.51 28.46 2022-12-31 17:00:00 30.18 34.90 2022-12-31 19:00:00 64.16 42.76 2022-12-31 21:00:00 64.09 42.75 probe_measure Soil Moisture 80cm #0 (Loam) \ datetime 2022-12-31 13:00:00 29.050000 2022-12-31 15:00:00 31.490000 2022-12-31 17:00:00 31.350000 2022-12-31 19:00:00 31.265714 2022-12-31 21:00:00 31.181429 probe_measure Soil Moisture 80cm #0 (Sandy Loam) \ datetime 2022-12-31 13:00:00 29.67 2022-12-31 15:00:00 29.67 2022-12-31 17:00:00 29.67 2022-12-31 19:00:00 29.67 2022-12-31 21:00:00 29.67 probe_measure Soil Salinity 10cm #0 Soil Salinity 20cm #0 \ datetime 2022-12-31 13:00:00 0.07 0.05 2022-12-31 15:00:00 0.14 0.11 2022-12-31 17:00:00 0.10 0.31 2022-12-31 19:00:00 0.14 0.36 2022-12-31 21:00:00 0.09 0.20 probe_measure Soil Salinity 30cm #0 Soil Salinity 40 cm #0 \ datetime 2022-12-31 13:00:00 0.39 1.60 2022-12-31 15:00:00 0.45 1.46 2022-12-31 17:00:00 0.07 0.07 2022-12-31 19:00:00 0.32 1.45 2022-12-31 21:00:00 0.33 1.46 probe_measure Soil Salinity 40cm #0 Soil Salinity 50cm #0 \ datetime 2022-12-31 13:00:00 0.02 0.06 2022-12-31 15:00:00 0.25 0.25 2022-12-31 17:00:00 1.34 0.82 2022-12-31 19:00:00 0.55 0.82 2022-12-31 21:00:00 0.56 0.82 probe_measure Soil Salinity 60cm #0 Soil Salinity 70cm #0 \ datetime 2022-12-31 13:00:00 0.79 0.28 2022-12-31 15:00:00 0.36 0.84 2022-12-31 17:00:00 0.28 1.04 2022-12-31 19:00:00 1.45 1.43 2022-12-31 21:00:00 1.45 1.43 probe_measure Soil Salinity 80cm #0 Soil Temperature 10cm #0 \ datetime 2022-12-31 13:00:00 0.31 20.26 2022-12-31 15:00:00 0.98 21.81 2022-12-31 17:00:00 0.40 22.96 2022-12-31 19:00:00 1.89 27.24 2022-12-31 21:00:00 1.89 25.87 probe_measure Soil Temperature 20cm #0 Soil Temperature 30cm #0 \ datetime 2022-12-31 13:00:00 20.52 20.63 2022-12-31 15:00:00 20.90 20.69 2022-12-31 17:00:00 21.01 17.10 2022-12-31 19:00:00 25.47 20.04 2022-12-31 21:00:00 23.55 20.44 probe_measure Soil Temperature 40cm #0 Soil Temperature 50cm #0 \ datetime 2022-12-31 13:00:00 22.30 21.36 2022-12-31 15:00:00 20.52 17.77 2022-12-31 17:00:00 20.59 21.31 2022-12-31 19:00:00 22.65 20.62 2022-12-31 21:00:00 21.23 20.72 probe_measure Soil Temperature 60cm #0 Soil Temperature 70cm #0 \ datetime 2022-12-31 13:00:00 20.87 20.51 2022-12-31 15:00:00 19.56 18.13 2022-12-31 17:00:00 19.86 18.62 2022-12-31 19:00:00 19.86 19.56 2022-12-31 21:00:00 19.14 18.63 probe_measure Soil Temperature 80cm #0 datetime 2022-12-31 13:00:00 19.99 2022-12-31 15:00:00 18.18 2022-12-31 17:00:00 16.98 2022-12-31 19:00:00 19.22 2022-12-31 21:00:00 18.18
Soil Data Principal Component Analysis (PCA)¶
- Aim to find patterns in the dataset.
- Reduce dimensionality of the data to visualize the principal components that explain the most variance.
- Data is standardised (fit_transform()) scales data to zero mean variance, missing values are replaced with 0s.
PCA Filling¶
- n_component=2 reduce the dimensionality of the data to two principal compaonents.
- data_scaled computes first two prinicpal components (reducing dimensionality while trying to preseve information).
Plot PCA¶
- Scatter plot to visualise (each point represents datetime entry in original dataset)
Overview¶
- PC1 highest variance in the data.
- PC2 next most significant relationship (independent to PC1).
- Each variable (soil measurement at various depths and types)
components_hold the loadings. Each row in components_ corresponds to one principal component, and each column relates to one original variable.
Analyse¶
- soil moisture measurements at different depths have varying influences on PC1 and PC2.
- "Soil Temperature 10cm #0" has a negative loading on PC1 (-0.103474) and a significant positive loading on PC2 (0.252541), suggesting that this variable moves inversely
- "Soil Moisture 10cm #0 (Sand)" has a high positive loading on PC1 (0.188933), indicating a strong positive correlation with this principal component.
- However, PC2 is minor (-0.020983), suggesting a smaller influence on this component.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Standardise Soil data
data_scaled = StandardScaler().fit_transform(pivoted_data.fillna(0))
# Fitting PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(data_scaled)
# Plotting the first two principal components
plt.figure(figsize=(8, 6))
plt.scatter(principal_components[:, 0], principal_components[:, 1], alpha=0.5)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Plot')
plt.grid(True)
plt.show()
# Get the loadings
loadings = pca.components_
# If you want to view them as a DataFrame to clearly see which loading corresponds to which original variable
import pandas as form
loadings_df = pd.DataFrame(data=loadings, columns=pivoted_data.columns, index=['PC1', 'PC2'])
print(loadings_df)
probe_measure #0 Soil Moisture 10cm #0 Soil Moisture 10cm Clay Loam \ PC1 0.158528 0.189372 PC2 0.056405 -0.065446 probe_measure #0 Soil Moisture 20cm #0 Soil Moisture 30cm \ PC1 0.177763 0.172261 PC2 -0.012268 -0.031844 probe_measure #0 Soil Moisture 40cm #0 Soil Moisture 50cm \ PC1 0.145901 0.099391 PC2 -0.058796 -0.082354 probe_measure #0 Soil Moisture 60cm #0 Soil Moisture 70cm \ PC1 -0.026400 -0.063232 PC2 0.054952 0.086618 probe_measure #0 Soil Moisture 80cm Soil Moisture 10cm #0 \ PC1 -0.078125 -0.052142 PC2 0.077560 -0.169513 probe_measure Soil Moisture 10cm #0 (Clayey Loam) \ PC1 0.087646 PC2 -0.091512 probe_measure Soil Moisture 10cm #0 (Sand) \ PC1 0.188933 PC2 -0.020983 probe_measure Soil Moisture 10cm #0 (Sandy Loam) Soil Moisture 20cm #0 \ PC1 0.082402 -0.049182 PC2 0.072915 -0.196743 probe_measure Soil Moisture 20cm #0 (Clayey Loam) \ PC1 0.030411 PC2 -0.141436 probe_measure Soil Moisture 20cm #0 (Sand) \ PC1 0.168229 PC2 -0.004098 probe_measure Soil Moisture 20cm #0 (Sandy Loam) Soil Moisture 30cm #0 \ PC1 0.041073 -0.099443 PC2 0.015525 -0.183118 probe_measure Soil Moisture 30cm #0 (Clay) \ PC1 0.156967 PC2 -0.019015 probe_measure Soil Moisture 30cm #0 (Clayey Loam) \ PC1 -0.078614 PC2 -0.103728 probe_measure Soil Moisture 30cm #0 (Sand) \ PC1 0.104204 PC2 -0.064057 probe_measure Soil Moisture 30cm #0 (Sandy Loam) Soil Moisture 40cm #0 \ PC1 0.043602 -0.111777 PC2 0.019072 -0.162580 probe_measure Soil Moisture 40cm #0 (Clay) \ PC1 -0.047556 PC2 0.007893 probe_measure Soil Moisture 40cm #0 (Clayey Loam) \ PC1 0.088103 PC2 0.061294 probe_measure Soil Moisture 40cm #0 (Loam) Soil Moisture 40cm #0 (Sand) \ PC1 -0.161165 0.121049 PC2 -0.178643 0.105128 probe_measure Soil Moisture 40cm #0 (Sandy Loam) Soil Moisture 50cm #0 \ PC1 -0.033735 0.153903 PC2 -0.036175 0.033991 probe_measure Soil Moisture 50cm #0 (Clay) Soil Moisture 50cm #0 (Loam) \ PC1 0.116949 -0.163579 PC2 0.112202 -0.179586 probe_measure Soil Moisture 50cm #0 (Sandy Loam) Soil Moisture 60cm #0 \ PC1 0.191981 0.080434 PC2 0.055874 0.085461 probe_measure Soil Moisture 60cm #0 (Clay) Soil Moisture 60cm #0 (Loam) \ PC1 0.032195 -0.169873 PC2 0.122076 -0.178292 probe_measure Soil Moisture 60cm #0 (Sandy Loam) Soil Moisture 70cm #0 \ PC1 0.211594 0.000004 PC2 0.055781 0.132446 probe_measure Soil Moisture 70cm #0 (Clay) Soil Moisture 70cm #0 (Loam) \ PC1 -0.015161 -0.164624 PC2 0.132925 -0.177107 probe_measure Soil Moisture 70cm #0 (Sandy Loam) Soil Moisture 80cm #0 \ PC1 0.210068 0.000637 PC2 0.055386 0.144855 probe_measure Soil Moisture 80cm #0 (Clay) Soil Moisture 80cm #0 (Loam) \ PC1 0.010687 -0.153394 PC2 0.054706 -0.163991 probe_measure Soil Moisture 80cm #0 (Sandy Loam) Soil Salinity 10cm #0 \ PC1 0.197972 -0.097511 PC2 0.082001 -0.103393 probe_measure Soil Salinity 20cm #0 Soil Salinity 30cm #0 \ PC1 -0.092661 -0.156129 PC2 -0.036041 -0.091572 probe_measure Soil Salinity 40 cm #0 Soil Salinity 40cm #0 \ PC1 -0.178361 0.008143 PC2 -0.086323 0.184075 probe_measure Soil Salinity 50cm #0 Soil Salinity 60cm #0 \ PC1 0.011944 -0.043088 PC2 0.192762 0.183059 probe_measure Soil Salinity 70cm #0 Soil Salinity 80cm #0 \ PC1 -0.119823 -0.082473 PC2 0.148286 0.174732 probe_measure Soil Temperature 10cm #0 Soil Temperature 20cm #0 \ PC1 -0.103474 -0.145885 PC2 0.252541 0.237648 probe_measure Soil Temperature 30cm #0 Soil Temperature 40cm #0 \ PC1 -0.151298 -0.154728 PC2 0.213827 0.196113 probe_measure Soil Temperature 50cm #0 Soil Temperature 60cm #0 \ PC1 -0.180452 -0.186563 PC2 0.185968 0.173661 probe_measure Soil Temperature 70cm #0 Soil Temperature 80cm #0 PC1 -0.183335 -0.195142 PC2 0.174083 0.145841
PCA Examination¶
- How much variance each prinipal component captures
- Scree Plot - importance of successive principal components
- Biplot - Both scores mapped
- Correlation Matrix
# PCA Examination
# Explained variance ratio
explained_variance = pca.explained_variance_ratio_
# How much variance each prinipal component captures
plt.figure(figsize=(6, 4))
plt.bar(range(1, len(explained_variance)+1), explained_variance, alpha=0.7, color=color_d[3], label='Individual explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.title('Explained Variance Plot')
plt.legend()
plt.show()
#Scree Plot - importance of successive principal components
plt.figure(figsize=(6, 4))
plt.plot(range(1, len(explained_variance)+1), explained_variance, 'o-', color=color_d[3])
plt.title('Scree Plot')
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.grid(True)
plt.show()
# Biplot - Both scores mapped
plt.figure(figsize=(8, 6))
plt.scatter(principal_components[:, 0], principal_components[:, 1], alpha=0.5)
for i, v in enumerate(pivoted_data.columns):
plt.arrow(0, 0, pca.components_[0, i]*max(principal_components[:, 0]),
pca.components_[1, i]*max(principal_components[:, 1]), color= color_d[3], alpha=0.5)
plt.text(pca.components_[0, i]*max(principal_components[:, 0])*1.2,
pca.components_[1, i]*max(principal_components[:, 1])*1.2, v, color=color_d[0], ha='center', va='center', fontsize=8)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()
# Correlation matrix
corr_matrix = pivoted_data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Matrix Heatmap')
plt.show()
Examine Median Values Soil probe measures¶
# Find median each type of probe measure
median_values = pivoted_data.median()
# Median values correlation analysis
median_df = pd.DataFrame(median_values, columns=['median_value'])
median_df['probe_measure'] = median_df.index
print("Median values by probe_measure:")
print(median_df)
Median values by probe_measure:
median_value probe_measure
probe_measure
#0 Soil Moisture 10cm 26.969500 #0 Soil Moisture 10cm
#0 Soil Moisture 10cm Clay Loam 26.150000 #0 Soil Moisture 10cm Clay Loam
#0 Soil Moisture 20cm 27.930000 #0 Soil Moisture 20cm
#0 Soil Moisture 30cm 24.108267 #0 Soil Moisture 30cm
#0 Soil Moisture 40cm 25.250000 #0 Soil Moisture 40cm
... ... ...
Soil Temperature 40cm #0 18.690000 Soil Temperature 40cm #0
Soil Temperature 50cm #0 18.840000 Soil Temperature 50cm #0
Soil Temperature 60cm #0 18.736201 Soil Temperature 60cm #0
Soil Temperature 70cm #0 18.710000 Soil Temperature 70cm #0
Soil Temperature 80cm #0 18.654004 Soil Temperature 80cm #0
[61 rows x 2 columns]
Correlation Between Median Values (soil values)¶
# Convert median_df
median_df = pd.DataFrame(median_values, columns=['median_value'])
median_df['probe_measure'] = median_df.index
# Plot
plt.figure(figsize=(14, 7))
sns.barplot(x='probe_measure', y='median_value', data=median_df.reset_index(drop=True), color=color_d[3])
plt.xticks(rotation=90)
plt.xlabel('Probe Measure')
plt.ylabel('Median Soil Value')
plt.title('Comparison of Median Soil Values by Probe Type')
plt.show()
User Interface - View Soil depth, Moisture, Salinity and Temperature¶
Soil Stats¶
Filter based on¶
- Soil type
- Soil depth
- Measurement type (such as moisture, temperature, salinity).
For each soil type (Sandy Loam, Clay, Clayey Loam, and Sand)¶
Soil Moisture: Measures the amount of water held in the soil, which can impact soil chemistry, plant growth, and microorganism activity.Soil Salinity: Indicates the salt content in the soil, which affects the soil's ability to hold water and nutrients. High salinity can be detrimental to most plants.Soil Temperature: Reflects the heat within the soil, which influences plant root growth and microbial activity.
pivoted_data.head()
| probe_measure | #0 Soil Moisture 10cm | #0 Soil Moisture 10cm Clay Loam | #0 Soil Moisture 20cm | #0 Soil Moisture 30cm | #0 Soil Moisture 40cm | #0 Soil Moisture 50cm | #0 Soil Moisture 60cm | #0 Soil Moisture 70cm | #0 Soil Moisture 80cm | Soil Moisture 10cm #0 | Soil Moisture 10cm #0 (Clayey Loam) | Soil Moisture 10cm #0 (Sand) | Soil Moisture 10cm #0 (Sandy Loam) | Soil Moisture 20cm #0 | Soil Moisture 20cm #0 (Clayey Loam) | Soil Moisture 20cm #0 (Sand) | Soil Moisture 20cm #0 (Sandy Loam) | Soil Moisture 30cm #0 | Soil Moisture 30cm #0 (Clay) | Soil Moisture 30cm #0 (Clayey Loam) | Soil Moisture 30cm #0 (Sand) | Soil Moisture 30cm #0 (Sandy Loam) | Soil Moisture 40cm #0 | Soil Moisture 40cm #0 (Clay) | Soil Moisture 40cm #0 (Clayey Loam) | Soil Moisture 40cm #0 (Loam) | Soil Moisture 40cm #0 (Sand) | Soil Moisture 40cm #0 (Sandy Loam) | Soil Moisture 50cm #0 | Soil Moisture 50cm #0 (Clay) | Soil Moisture 50cm #0 (Loam) | Soil Moisture 50cm #0 (Sandy Loam) | Soil Moisture 60cm #0 | Soil Moisture 60cm #0 (Clay) | Soil Moisture 60cm #0 (Loam) | Soil Moisture 60cm #0 (Sandy Loam) | Soil Moisture 70cm #0 | Soil Moisture 70cm #0 (Clay) | Soil Moisture 70cm #0 (Loam) | Soil Moisture 70cm #0 (Sandy Loam) | Soil Moisture 80cm #0 | Soil Moisture 80cm #0 (Clay) | Soil Moisture 80cm #0 (Loam) | Soil Moisture 80cm #0 (Sandy Loam) | Soil Salinity 10cm #0 | Soil Salinity 20cm #0 | Soil Salinity 30cm #0 | Soil Salinity 40 cm #0 | Soil Salinity 40cm #0 | Soil Salinity 50cm #0 | Soil Salinity 60cm #0 | Soil Salinity 70cm #0 | Soil Salinity 80cm #0 | Soil Temperature 10cm #0 | Soil Temperature 20cm #0 | Soil Temperature 30cm #0 | Soil Temperature 40cm #0 | Soil Temperature 50cm #0 | Soil Temperature 60cm #0 | Soil Temperature 70cm #0 | Soil Temperature 80cm #0 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| datetime | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2022-12-31 13:00:00 | 22.28 | 26.680000 | 28.21 | 15.54 | 34.67 | 30.98 | 34.14 | 51.67 | 35.38 | 19.77 | 27.62 | 16.870000 | 19.46 | 26.18 | 31.35 | 18.650000 | 22.11 | 24.15 | 39.02 | 37.230000 | 21.650000 | 21.950000 | 46.97 | 38.58 | 31.140000 | 29.520000 | 13.080000 | 27.230000 | 32.63 | 39.79 | 30.78 | 29.640000 | 49.62 | 37.74 | 27.510000 | 30.180000 | 16.25 | 39.00 | 28.990000 | 30.18 | 54.65 | 28.46 | 29.050000 | 29.67 | 0.07 | 0.05 | 0.39 | 1.60 | 0.02 | 0.06 | 0.79 | 0.28 | 0.31 | 20.26 | 20.52 | 20.63 | 22.30 | 21.36 | 20.87 | 20.51 | 19.99 |
| 2022-12-31 15:00:00 | 28.94 | 26.640000 | 28.20 | 30.04 | 34.65 | 47.13 | 41.87 | 32.78 | 35.57 | 20.07 | 22.92 | 16.890000 | 20.77 | 31.15 | 30.29 | 18.630000 | 25.59 | 27.59 | 32.08 | 26.280000 | 19.540000 | 25.150000 | 12.30 | 31.34 | 37.110000 | 31.850000 | 13.080000 | 27.230000 | 34.08 | 41.30 | 33.14 | 29.630000 | 41.87 | 36.77 | 30.290000 | 30.180000 | 58.48 | 34.26 | 31.590000 | 30.18 | 37.51 | 28.46 | 31.490000 | 29.67 | 0.14 | 0.11 | 0.45 | 1.46 | 0.25 | 0.25 | 0.36 | 0.84 | 0.98 | 21.81 | 20.90 | 20.69 | 20.52 | 17.77 | 19.56 | 18.13 | 18.18 |
| 2022-12-31 17:00:00 | 17.98 | 26.690000 | 26.11 | 17.92 | 34.57 | 42.47 | 42.39 | 54.53 | 55.72 | 25.13 | 23.04 | 14.640000 | 30.66 | 16.07 | 28.75 | 15.680000 | 18.00 | 30.76 | 31.79 | 30.460000 | 21.600000 | 28.410000 | 27.97 | 31.28 | 31.030000 | 30.410000 | 13.070000 | 27.220000 | 38.16 | 33.97 | 32.16 | 29.620000 | 16.78 | 33.38 | 28.740000 | 30.180000 | 64.63 | 34.21 | 30.760000 | 30.18 | 30.18 | 34.90 | 31.350000 | 29.67 | 0.10 | 0.31 | 0.07 | 0.07 | 1.34 | 0.82 | 0.28 | 1.04 | 0.40 | 22.96 | 21.01 | 17.10 | 20.59 | 21.31 | 19.86 | 18.62 | 16.98 |
| 2022-12-31 19:00:00 | 21.56 | 26.654286 | 26.08 | 17.90 | 20.48 | 30.57 | 41.54 | 50.76 | 55.55 | 19.49 | 23.05 | 16.760000 | 19.79 | 26.08 | 26.02 | 18.590000 | 21.57 | 27.57 | 32.91 | 30.452857 | 19.530000 | 28.118571 | 40.52 | 35.35 | 31.898571 | 30.315714 | 13.100000 | 27.214286 | 30.57 | 36.90 | 32.11 | 29.615714 | 55.02 | 40.70 | 28.682857 | 30.181429 | 64.34 | 42.80 | 30.695714 | 30.18 | 64.16 | 42.76 | 31.265714 | 29.67 | 0.14 | 0.36 | 0.32 | 1.45 | 0.55 | 0.82 | 1.45 | 1.43 | 1.89 | 27.24 | 25.47 | 20.04 | 22.65 | 20.62 | 19.86 | 19.56 | 19.22 |
| 2022-12-31 21:00:00 | 21.05 | 26.618571 | 26.06 | 17.94 | 20.43 | 30.56 | 41.48 | 50.56 | 55.43 | 20.09 | 23.02 | 16.308333 | 19.65 | 26.06 | 26.01 | 18.466667 | 21.60 | 30.83 | 32.07 | 30.445714 | 19.868333 | 27.827143 | 36.83 | 35.32 | 32.767143 | 30.221429 | 13.092857 | 27.208571 | 41.40 | 36.89 | 32.06 | 29.611429 | 54.75 | 40.64 | 28.625714 | 30.182857 | 50.56 | 42.77 | 30.631429 | 30.18 | 64.09 | 42.75 | 31.181429 | 29.67 | 0.09 | 0.20 | 0.33 | 1.46 | 0.56 | 0.82 | 1.45 | 1.43 | 1.89 | 25.87 | 23.55 | 20.44 | 21.23 | 20.72 | 19.14 | 18.63 | 18.18 |
# User interface to view Soil data
pivoted_data_index = pivoted_data.copy()
# Filter columns
def extract_info_from_columns(df):
types = set()
depths = set()
measurements = set()
col_pattern = r"(Soil \w+) (\d+cm) #\d+( \(([\w ]+)\))?"
for col in df.columns:
match = re.match(col_pattern, col)
if match:
measurement, depth, _, soil_type = match.groups()
measurements.add(measurement)
depths.add(depth)
if soil_type:
types.add(soil_type)
else:
types.add('General')
return sorted(types), sorted(depths), sorted(measurements)
types, depths, measurements = extract_info_from_columns(pivoted_data_index)
# Filter data
def filter_data(df, soil_type, depth, measurement):
regex = f"{measurement} {depth} #\d+"
if soil_type != 'General':
regex += f" \({soil_type}\)"
filtered_df = df.filter(regex=regex)
return filtered_df
# Drop down
type_selector = widgets.Dropdown(options=types, description='Soil Type:')
depth_selector = widgets.Dropdown(options=depths, description='Depth:')
measurement_selector = widgets.Dropdown(options=measurements, description='Measurement:')
output = widgets.Output()
# Update tables
def update_output(change):
with output:
output.clear_output()
filtered_df = filter_data(pivoted_data_index, type_selector.value, depth_selector.value, measurement_selector.value)
print(filtered_df)
type_selector.observe(update_output, names='value')
depth_selector.observe(update_output, names='value')
measurement_selector.observe(update_output, names='value')
display(type_selector, depth_selector, measurement_selector, output)
Dropdown(description='Soil Type:', options=('Clay', 'Clayey Loam', 'General', 'Loam', 'Sand', 'Sandy Loam'), v…
Dropdown(description='Depth:', options=('10cm', '20cm', '30cm', '40cm', '50cm', '60cm', '70cm', '80cm'), value…
Dropdown(description='Measurement:', options=('Soil Moisture', 'Soil Salinity', 'Soil Temperature'), value='So…
Output()
Soil VWC: Volumetric water content - Stats, Time Series¶
# Soil VWC: Volumetric water content
# Regular expression (`regex`) to select columns whose names match the pattern etc "Soil Moisture"
vwc_depths = pivoted_data.filter(regex='Soil Moisture \d+cm').mean(axis=1)
sandy_loam_vwc = pivoted_data.filter(regex='Sandy Loam').mean(axis=1)
average_vwc = pivoted_data.filter(regex='Soil Moisture').mean(axis=1)
# Average soil vwc depth
vwc_cols = [col for col in pivoted_data.columns if "Soil Moisture" in col and not any(x in col for x in ['Clay', 'Loam', 'Sand'])]
average_vwc_by_depth = pivoted_data[vwc_cols].mean()
plt.figure(figsize=(14, 7))
average_vwc_by_depth.plot(kind='bar', color=color_d[3])
plt.title('Average Soil VWC: Volumetric Water content by Depth')
plt.xlabel('Depth')
plt.ylabel('Average VWC (%)')
plt.show()
# Print stats
print("Statistics of VWC: Volumetric Water content Data:")
print(pivoted_data[vwc_cols].describe())
# Graph
plt.figure(figsize=(20, 20))
for i, col in enumerate(vwc_cols):
plt.subplot(4, 4, i+1)
plt.plot(pivoted_data[col], label=col)
plt.title(f'Time Series of {col}')
plt.xlabel('Date')
plt.ylabel('VWC: Volumetric Water Content (%)')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
average_vwc_by_depth.describe()
Statistics of VWC: Volumetric Water content Data: probe_measure #0 Soil Moisture 10cm #0 Soil Moisture 20cm \ count 3665.000000 3665.000000 mean 28.873380 29.589756 std 7.856406 6.807928 min 14.920000 16.730000 25% 22.497868 24.705556 50% 26.969500 27.930000 75% 33.910000 33.196667 max 69.770000 61.190000 probe_measure #0 Soil Moisture 30cm #0 Soil Moisture 40cm \ count 3665.000000 3665.000000 mean 25.338563 27.530134 std 8.353125 10.175964 min 12.440000 9.530000 25% 17.880000 19.870000 50% 24.108267 25.250000 75% 30.614444 32.440000 max 58.710000 66.060000 probe_measure #0 Soil Moisture 50cm #0 Soil Moisture 60cm \ count 3665.000000 3665.000000 mean 33.922414 36.156958 std 8.513603 9.849048 min 12.250000 12.490000 25% 29.010000 30.761111 50% 31.165985 38.130000 75% 39.316068 41.340000 max 61.180000 56.750000 probe_measure #0 Soil Moisture 70cm #0 Soil Moisture 80cm \ count 3665.000000 3665.000000 mean 39.161877 41.086853 std 10.254374 10.250347 min 14.050000 13.830000 25% 32.650000 34.410000 50% 41.149326 42.110000 75% 45.890000 49.132461 max 62.530000 60.860000 probe_measure Soil Moisture 10cm #0 Soil Moisture 20cm #0 \ count 3665.000000 3665.000000 mean 41.612315 41.021041 std 17.334048 13.406038 min 13.220000 13.740000 25% 24.730000 27.490000 50% 40.530000 41.950000 75% 57.190386 53.630216 max 102.420000 75.080000 probe_measure Soil Moisture 30cm #0 Soil Moisture 40cm #0 \ count 3665.000000 3665.000000 mean 44.909798 48.664151 std 18.224675 20.380487 min 12.590000 9.520000 25% 30.340000 31.940000 50% 39.450000 46.100000 75% 67.250115 74.640262 max 77.490000 81.400000 probe_measure Soil Moisture 50cm #0 Soil Moisture 60cm #0 \ count 3665.000000 3665.000000 mean 39.504322 44.055190 std 11.964946 12.278736 min 12.240000 12.580000 25% 29.690000 36.710000 50% 36.360000 45.716725 75% 50.080000 53.440000 max 79.080000 71.250000 probe_measure Soil Moisture 70cm #0 Soil Moisture 80cm #0 count 3665.000000 3665.000000 mean 46.382121 47.851660 std 13.424827 13.031818 min 14.150000 13.420000 25% 38.420000 40.730000 50% 46.120000 50.069171 75% 56.910000 56.130000 max 83.220000 82.420000
count 16.000000 mean 38.478783 std 7.488125 min 25.338563 25% 32.839250 50% 40.262682 75% 44.268842 max 48.664151 dtype: float64
Soil Temperature - Stats, Time Series¶
# Calculate and print average soil temperature by depth, soil type
temperature_depths = pivoted_data.filter(regex='Soil Temperature \d+cm').mean(axis=1)
sandy_loam_temperature = pivoted_data.filter(regex='Sandy Loam Temperature').mean(axis=1)
average_temperature = pivoted_data.filter(regex='Soil Temperature').mean(axis=1)
temperature_cols = [col for col in pivoted_data.columns if "Soil Temperature" in col and not any(x in col for x in ['Clay', 'Loam', 'Sand'])]
average_temperature_by_depth = pivoted_data[temperature_cols].mean()
plt.figure(figsize=(14, 7))
average_temperature_by_depth.plot(kind='bar')
plt.title('Average Soil Temperature by Depth')
plt.xlabel('Depth')
plt.ylabel('Average Temperature (°C)')
plt.show()
print("Statistics of Temperature Data:")
print(pivoted_data[temperature_cols].describe())
# Time Series of Soil Temperature
plt.figure(figsize=(20, 20))
for i, col in enumerate(temperature_cols):
plt.subplot(4, 4, i+1)
plt.plot(pivoted_data[col], label=col)
plt.title(f'Time Series of {col}')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
average_temperature_by_depth.describe()
Statistics of Temperature Data: probe_measure Soil Temperature 10cm #0 Soil Temperature 20cm #0 \ count 3665.000000 3665.000000 mean 18.608960 18.506109 std 2.816558 2.393438 min 8.430000 8.690000 25% 17.280000 17.600000 50% 18.490000 18.770000 75% 20.190000 19.940000 max 27.960000 25.470000 probe_measure Soil Temperature 30cm #0 Soil Temperature 40cm #0 \ count 3665.000000 3665.000000 mean 18.379256 18.402152 std 2.223136 2.098157 min 9.430000 9.750000 25% 17.690500 17.919952 50% 18.670000 18.690000 75% 19.740600 19.660000 max 24.280000 23.600000 probe_measure Soil Temperature 50cm #0 Soil Temperature 60cm #0 \ count 3665.000000 3665.000000 mean 18.070411 17.933134 std 2.259744 2.190198 min 8.900000 10.200000 25% 16.900000 16.990000 50% 18.840000 18.736201 75% 19.577800 19.420000 max 22.490000 22.490000 probe_measure Soil Temperature 70cm #0 Soil Temperature 80cm #0 count 3665.000000 3665.000000 mean 17.797305 17.808587 std 2.163559 2.116657 min 10.010000 9.490000 25% 16.741983 16.700000 50% 18.710000 18.654004 75% 19.300000 19.340000 max 22.190000 21.560000
count 8.000000 mean 18.188239 std 0.324306 min 17.797305 25% 17.901997 50% 18.224834 75% 18.428141 max 18.608960 dtype: float64
Soil Salinity - Stats, Time Series¶
# Soil Salinity
salinity_depths = pivoted_data.filter(regex='Soil Salinity \d+cm').mean(axis=1)
sandy_loam_salinity = pivoted_data.filter(regex='Sandy Loam Salinity').mean(axis=1)
average_salinity = pivoted_data.filter(regex='Soil Salinity').mean(axis=1)
salinity_cols = [col for col in pivoted_data.columns if "Soil Salinity" in col and not any(x in col for x in ['Clay', 'Loam', 'Sand'])]
average_salinity_by_depth = pivoted_data[salinity_cols].mean()
plt.figure(figsize=(14, 7))
average_salinity_by_depth.plot(kind='bar')
plt.title('Average Soil Salinity by Depth')
plt.xlabel('Depth')
plt.ylabel('Average Salinity (°C)')
plt.show()
print("Statistics of Salinity Data:")
print(pivoted_data[salinity_cols].describe())
# Time Series of Soil Salinity
plt.figure(figsize=(20, 20))
for i, col in enumerate(salinity_cols):
plt.subplot(4, 4, i+1)
plt.plot(pivoted_data[col], label=col)
plt.title(f'Time Series of {col}')
plt.xlabel('Date')
plt.ylabel('Salinity (µS/cm)')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
average_salinity_by_depth.describe()
Statistics of Salinity Data: probe_measure Soil Salinity 10cm #0 Soil Salinity 20cm #0 \ count 3665.000000 3665.000000 mean 0.211577 0.295087 std 0.142127 0.158075 min 0.010000 0.010000 25% 0.110000 0.180000 50% 0.170000 0.300000 75% 0.329868 0.390018 max 1.480000 1.340000 probe_measure Soil Salinity 30cm #0 Soil Salinity 40 cm #0 \ count 3665.000000 3665.000000 mean 0.508213 0.838585 std 0.344351 0.619688 min 0.020000 0.010000 25% 0.200000 0.290000 50% 0.400000 0.700000 75% 0.939934 1.580000 max 1.130000 1.650000 probe_measure Soil Salinity 40cm #0 Soil Salinity 50cm #0 \ count 3665.000000 3665.000000 mean 0.387922 0.601854 std 0.258629 0.347268 min 0.020000 0.050000 25% 0.190000 0.314736 50% 0.380000 0.640000 75% 0.470000 0.711177 max 1.600000 1.630000 probe_measure Soil Salinity 60cm #0 Soil Salinity 70cm #0 \ count 3665.000000 3665.000000 mean 0.934549 0.876050 std 0.447738 0.420299 min 0.060000 0.070000 25% 0.510000 0.430000 50% 1.006933 1.039611 75% 1.200000 1.250000 max 1.840000 1.540000 probe_measure Soil Salinity 80cm #0 count 3665.000000 mean 1.169294 std 0.706054 min 0.060000 25% 0.460000 50% 1.150000 75% 1.920000 max 2.000000
count 9.000000 mean 0.647015 std 0.325289 min 0.211577 25% 0.387922 50% 0.601854 75% 0.876050 max 1.169294 dtype: float64
MicroLab and Soil Data Pre-Processing & Normalisation¶
- temperature, humidity, light, co2, soil_values
# Load the DataFrames from CSV files
microclimate_sensor_cleaned = pd.read_csv('/content/drive/My Drive/sit378_eco_metrix/microclimate_sensor_cleaned.csv')
microlabs_co2_sensor_data_cleaned = pd.read_csv('/content/drive/My Drive/sit378_eco_metrix/microlabs_co2_sensor_data_cleaned.csv')
df_soil_sensor_data_cleaned = pd.read_csv('/content/drive/My Drive/sit378_eco_metrix/df_soil_sensor_data_cleaned.csv')
soil_co2_sensor_data.head()
| datetime | site_id | id | probe_measure | soil_value | unit | latitude_x | longitude_x | location | time_x | date_x | dev_id | time_y | temperature | humidity | light | motion | visit | vdd | lat_long | co2 | latitude_y | longitude_y | date_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022-12-31 13:00:00 | 64979 | 15522826 | Soil Moisture 50cm #0 | 32.63 | %VWC | -37.80170 | 144.97138 | -37.801699999983285, 144.9713799999935 | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 13:00:46 | 24.5 | 61 | 0.0 | 0 | 0 | 3656 | -37.8133183, 144.9668853 | 416.0 | -37.813318 | 144.966885 | 2022-12-31 |
| 1 | 2022-12-31 13:00:00 | 66195 | 15522987 | Soil Temperature 50cm #0 | 21.36 | ºC | -37.82073 | 144.97254 | -37.82072999998553, 144.97253999999268 | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 13:00:46 | 24.5 | 61 | 0.0 | 0 | 0 | 3656 | -37.8133183, 144.9668853 | 416.0 | -37.813318 | 144.966885 | 2022-12-31 |
| 2 | 2022-12-31 13:00:00 | 64976 | 15515162 | Soil Salinity 20cm #0 | 0.05 | µS/cm | -37.78292 | 144.95566 | -37.78291999998904, 144.95566000000628 | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 13:00:46 | 24.5 | 61 | 0.0 | 0 | 0 | 3656 | -37.8133183, 144.9668853 | 416.0 | -37.813318 | 144.966885 | 2022-12-31 |
| 3 | 2022-12-31 13:00:00 | 64986 | 15521623 | Soil Moisture 50cm #0 | 51.39 | %VWC | -37.82921 | 144.97348 | -37.8292099999859, 144.97347999998564 | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 13:00:46 | 24.5 | 61 | 0.0 | 0 | 0 | 3656 | -37.8133183, 144.9668853 | 416.0 | -37.813318 | 144.966885 | 2022-12-31 |
| 4 | 2022-12-31 13:00:00 | 64979 | 15514301 | Soil Temperature 30cm #0 | 20.63 | ºC | -37.80170 | 144.97138 | -37.801699999983285, 144.9713799999935 | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 13:00:46 | 24.5 | 61 | 0.0 | 0 | 0 | 3656 | -37.8133183, 144.9668853 | 416.0 | -37.813318 | 144.966885 | 2022-12-31 |
# Date Time
soil_co2_sensor_data['date_x'] = pd.to_datetime(soil_co2_sensor_data['date_x']).dt.date
soil_co2_sensor_data['datetime'] = pd.to_datetime(soil_co2_sensor_data['date_x'].astype(str) + ' ' + soil_co2_sensor_data['time_x'].astype(str))
# Normalise
from sklearn.preprocessing import MinMaxScaler
soil_co2_sensor_data_norm = soil_co2_sensor_data.copy()
soil_co2_sensor_data_norm['date_x'] = pd.to_datetime(soil_co2_sensor_data_norm['date_x']).dt.date
soil_co2_sensor_data_norm['datetime'] = pd.to_datetime(soil_co2_sensor_data_norm['date_x'].astype(str) + ' ' + soil_co2_sensor_data_norm['time_x'].astype(str))
features_to_normalize = ['temperature', 'humidity', 'light', 'co2', 'soil_value']
scaler = MinMaxScaler()
soil_co2_sensor_data_norm[features_to_normalize] = scaler.fit_transform(soil_co2_sensor_data_norm[features_to_normalize])
print(soil_co2_sensor_data_norm.head())
datetime site_id id probe_measure \
0 2022-12-31 13:00:00 64979 15522826 Soil Moisture 50cm #0
1 2022-12-31 13:00:00 66195 15522987 Soil Temperature 50cm #0
2 2022-12-31 13:00:00 64976 15515162 Soil Salinity 20cm #0
3 2022-12-31 13:00:00 64986 15521623 Soil Moisture 50cm #0
4 2022-12-31 13:00:00 64979 15514301 Soil Temperature 30cm #0
soil_value unit latitude_x longitude_x \
0 0.310851 %VWC -37.80170 144.97138
1 0.203487 ºC -37.82073 144.97254
2 0.000476 µS/cm -37.78292 144.95566
3 0.489568 %VWC -37.82921 144.97348
4 0.196532 ºC -37.80170 144.97138
location time_x date_x dev_id \
0 -37.801699999983285, 144.9713799999935 13:00:00 2022-12-31 ers-co2-9da0
1 -37.82072999998553, 144.97253999999268 13:00:00 2022-12-31 ers-co2-9da0
2 -37.78291999998904, 144.95566000000628 13:00:00 2022-12-31 ers-co2-9da0
3 -37.8292099999859, 144.97347999998564 13:00:00 2022-12-31 ers-co2-9da0
4 -37.801699999983285, 144.9713799999935 13:00:00 2022-12-31 ers-co2-9da0
time_y temperature humidity light motion visit vdd \
0 13:00:46 0.633333 0.545455 0.0 0 0 3656
1 13:00:46 0.633333 0.545455 0.0 0 0 3656
2 13:00:46 0.633333 0.545455 0.0 0 0 3656
3 13:00:46 0.633333 0.545455 0.0 0 0 3656
4 13:00:46 0.633333 0.545455 0.0 0 0 3656
lat_long co2 latitude_y longitude_y date_y
0 -37.8133183, 144.9668853 0.019147 -37.813318 144.966885 2022-12-31
1 -37.8133183, 144.9668853 0.019147 -37.813318 144.966885 2022-12-31
2 -37.8133183, 144.9668853 0.019147 -37.813318 144.966885 2022-12-31
3 -37.8133183, 144.9668853 0.019147 -37.813318 144.966885 2022-12-31
4 -37.8133183, 144.9668853 0.019147 -37.813318 144.966885 2022-12-31
# # drop nearest_co2_index
# soil_co2_sensor_data_norm = soil_co2_sensor_data_norm.drop('nearest_co2_index', axis=1)
Temperature, CO2 and Soil Value Level over Time¶
# Plot Temp, Co2 and Soil Value Time Series
plt.figure(figsize=(14, 6))
plt.plot(soil_co2_sensor_data_norm['temperature'], label='Temperature', color=color_d[0], alpha=0.7)
plt.plot(soil_co2_sensor_data_norm['co2'], label='CO2 Level', color=color_d[3], alpha=0.7)
plt.plot(soil_co2_sensor_data_norm['soil_value'], label='Soil Level', color=color_d[1], alpha=0.4)
plt.axhline(soil_co2_sensor_data_norm['temperature'].mean(), color='orange', linestyle='--', label='Avg Temperature')
plt.axhline(soil_co2_sensor_data_norm['co2'].mean(), color='red', linestyle='--', label='Avg CO2 Level')
plt.axhline(soil_co2_sensor_data_norm['soil_value'].mean(), color='purple', linestyle='--', label='Avg Soil Value Level')
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.title('Temperature, CO2 and Soil Value Level over Time')
plt.legend()
plt.show()
Daily Means of Temperature, Humidity, Light, CO2, and soil_value¶
# Find daily means for the numeric columns only
daily_means = soil_co2_sensor_data_norm.groupby('datetime')[features_to_normalize].mean()
plt.figure(figsize=(15, 7))
plt.plot(daily_means.index, daily_means['temperature'], label='Temperature', color=color_d[0])
plt.plot(daily_means.index, daily_means['humidity'], label='Humidity', color=color_d[3])
plt.plot(daily_means.index, daily_means['light'], label='Light', color=color_l[3])
plt.plot(daily_means.index, daily_means['co2'], label='CO2', color=color_l[1])
plt.plot(daily_means.index, daily_means['soil_value'], label='soil_value', color=color_d[2])
plt.title('Daily Means of Temperature, Humidity, Light, CO2, and soil_value')
plt.xlabel('Date')
plt.ylabel('Normalized Value')
plt.legend()
plt.grid(True)
plt.show()
color_original = color_d[2]
color_locf = color_d[1]
Correlation Analysis To see how different variables relate to each other¶
# Correlation Analysis To see how different variables relate to each other
pd.plotting.scatter_matrix(soil_co2_sensor_data_norm[['temperature', 'humidity', 'light', 'co2', 'soil_value']], color=color_d[3], figsize=(10, 10))
plt.show()
Correlation Heat Map¶
- ['soil_value', 'temperature', 'humidity', 'light', 'motion', 'vdd', 'co2']
- Ref https://python-graph-gallery.com/92-control-color-in-seaborn-heatmaps/
# Heat Map for Correlation
columns_to_include = ['soil_value', 'temperature', 'humidity', 'light', 'motion', 'vdd', 'co2']
numeric_data = soil_co2_sensor_data_norm[columns_to_include]
correlation_matrix = numeric_data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='Greens', fmt=".2f")
plt.title('Correlation matrix of selected data in soil_co2_sensor_data_norm')
plt.show()
Correlations between variables - Scatter Plot¶
- ['soil_value', 'temperature', 'humidity', 'light', 'motion', 'vdd', 'co2']
# check correlations between variables
variables_to_check = ['soil_value', 'temperature', 'humidity', 'light', 'motion', 'vdd', 'co2']
# Scatter plots
for i in range(len(variables_to_check)):
for j in range(i + 1, len(variables_to_check)):
# Check Correlation Coefficient
corr_coeff = numeric_data[variables_to_check[i]].corr(numeric_data[variables_to_check[j]])
if abs(corr_coeff) > 0.5:
sns.scatterplot(data=numeric_data, x=variables_to_check[i], y=variables_to_check[j])
plt.title(f'Scatter plot of {variables_to_check[i]} vs {variables_to_check[j]} (Correlation: {corr_coeff:.2f})')
plt.show()
from scipy.stats import pearsonr, spearmanr
# Pearson correlation test
stat, p = pearsonr(numeric_data['temperature'], numeric_data['co2'])
print(f'Pearson correlation between temperature and CO2: Stat={stat:.3f}, p={p:.3f}')
# Spearman correlation test
stat, p = spearmanr(numeric_data['temperature'], numeric_data['co2'])
print(f'Spearman correlation between temperature and CO2: Stat={stat:.3f}, p={p:.3f}')
Pearson correlation between temperature and CO2: Stat=0.030, p=0.000 Spearman correlation between temperature and CO2: Stat=0.158, p=0.000
Refelction Pearson & Spearman Correlation¶
Both Pearson & Spearman indicate that there is a statistically significant (weak) relationship between temperature and CO2 levels.
Some changes in temperature have some degree of association with changes in CO2 levels, many other factors are likely influencing CO2 levels as well.
The impact of temperature on CO2 levels is small, but consistent enough to be statistically significant.
Granger Causality test (Prepare for ML Modelling)¶
- Rows = 'caused' variables
- Columns = 'causing' variables.
- A low p-value (typically < 0.05) suggests that the 'causing' variable can predict future values of the 'caused' variable, indicating Granger causality.
- A high p-value (≥0.05) suggests that there is not enough evidence to reject the null hypothesis, implying that the "causing" variable does not have a predictive effect on the "caused" variable.
from statsmodels.tsa.stattools import grangercausalitytests
# dataset converted to a time series format, indexed by datetime
soil_co2_sensor_data_norm['datetime'] = pd.to_datetime(soil_co2_sensor_data_norm['datetime'])
soil_co2_sensor_data_norm.set_index('datetime', inplace=True)
# Sort the data by datetime just in case it's not sorted
soil_co2_sensor_data_norm.sort_index(inplace=True)
# filter values
data_for_test = soil_co2_sensor_data_norm[['soil_value', 'temperature', 'humidity', 'light', 'motion', 'vdd', 'co2']]
# Granger Causality test
max_lags = 4 # Number of lags
results = pd.DataFrame(index=data_for_test.columns, columns=data_for_test.columns, dtype=float)
for col in data_for_test.columns:
for row in data_for_test.columns:
if col != row: # skip itself / variable
test_result = grangercausalitytests(data_for_test[[row, col]], maxlag=max_lags, verbose=False)
# Smallest p-value
p_values = [round(test_result[i+1][0]['ssr_chi2test'][1], 4) for i in range(max_lags)]
min_p_value = min(p_values)
results.loc[row, col] = min_p_value
print(results)
soil_value temperature humidity light motion vdd co2 soil_value NaN 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 temperature 0.0007 NaN 0.0000 0.0016 0.7412 0.0000 0.1245 humidity 0.0747 0.0000 NaN 0.0002 0.8970 0.0187 0.4236 light 0.3656 0.0000 0.0035 NaN 0.5778 0.0000 0.2824 motion 0.0158 0.4526 0.6044 0.6395 NaN 0.2706 0.0000 vdd 0.5401 0.0000 0.8724 0.0000 0.5125 NaN 0.2764 co2 0.0771 0.0003 0.7894 0.2816 0.0000 0.0034 NaN
Review Granger Causality test¶
Possible Predictors¶
- Temperature predicts soil_value (p=0.0007), humidity (p=0.0000), light (p=0.0016), and vdd (p=0.0000) but not co2 (p=0.1245).
- CO2 significantly predicts motion (p=0.0000) and vdd (p=0.0034), but has no significance to humidity (p=0.7894) and light (p=0.2816).
- Humidity, Light, Motion, and VDD: mixed results in predicting other variables, where some are significant predictors while others are not.
Haversine formula calculate distance between sensor locations¶
- 1.4km distance between microclimate Sensor Microlabs Co2 Senor
# Map the latitude and longitude from both datasets
import folium
coords_microclimate = [-37.800575, 144.964635]
coords_norm_microlabs = [-37.8133183, 144.9668853]
# Create a map centered around the average of the given coordinates
map_center = [(coords_microclimate[0] + coords_norm_microlabs[0]) / 2,
(coords_microclimate[1] + coords_norm_microlabs[1]) / 2]
map = folium.Map(location=map_center, zoom_start=14)
# Add markers for each dataset's location
folium.Marker(coords_microclimate, popup='Microclimate Sensor', tooltip='Microclimate Sensor').add_to(map)
folium.Marker(coords_norm_microlabs, popup='Norm Microlabs CO2 Sensor', tooltip='Norm Microlabs CO2 Sensor').add_to(map)
map
# Haversine formula: Distance between the two sensor locations
def haversine(lat1, lon1, lat2, lon2):
R = 6371.0 # Radius of the Earth in kilometers
# Convert lat & long degrees to radians
lat1_rad, lon1_rad = math.radians(lat1), math.radians(lon1)
lat2_rad, lon2_rad = math.radians(lat2), math.radians(lon2)
dlat = lat2_rad - lat1_rad
dlon = lon2_rad - lon1_rad
# Haversine formula
a = math.sin(dlat / 2)**2 + math.cos(lat1_rad) * math.cos(lat2_rad) * math.sin(dlon / 2)**2
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
distance = R * c # Distance in kilometers
return distance
distance = haversine(coords_microclimate[0], coords_microclimate[1],
coords_norm_microlabs[0], coords_norm_microlabs[1])
print(f'Distance between Sensor locations: {distance}')
Distance between Sensor locations: 1.4307148758774588
microclimate_sensor (Review)¶
- TPH.TEMP -40.00 85.00 Ambient Air Temperature °C
- TPH.RH 0.00 100.00 Relative Humidity %
- TPH.PRESSURE 10.0 2000.0 Barometric Pressure hPa
- PM2.5 0.0 999.9 Mass density of particles in the air smaller than 2.5 micrometers in diameter μg/m3
- PM10 0.0 1999.9 Mass density of particles in the air smaller than 10 micrometers in diameter μg/m3
- WS 0.00 160.00 Average Wind Speed km/h
# Create date column
microclimate_sensor_cleaned['time'] = pd.to_datetime(microclimate_sensor_cleaned['start_reading'])
microclimate_sensor_cleaned['date'] = microclimate_sensor_cleaned['time'].dt.date
# Count null values 'id' column
null_ids_count = microclimate_sensor_cleaned['id'].isnull().sum()
print("Number of null values in the 'id' column:", null_ids_count)
null_id_records = microclimate_sensor_cleaned[microclimate_sensor_cleaned['id'].isnull()]
print(null_id_records)
# Drop records null
microclimate_sensor_cleaned = microclimate_sensor_cleaned.dropna(subset=['id'])
null_ids_count_after = microclimate_sensor_cleaned['id'].isnull().sum()
print("Number of null values in the 'id' column after cleaning:", null_ids_count_after)
Number of null values in the 'id' column: 0 Empty DataFrame Columns: [site_id, gatewayhub_id_x, site_status_x, start_reading, latitude, longitude, location, local_time, id, sensor_id, value, type, units, gatewayhub_id_y, site_status_y, time, date] Index: [] Number of null values in the 'id' column after cleaning: 0
unique_site_status_x = microclimate_sensor_cleaned['site_status_x'].unique()
unique_id = microclimate_sensor_cleaned['id'].unique()
unique_type = microclimate_sensor_cleaned['type'].unique()
unique_units = microclimate_sensor_cleaned['units'].unique()
print("Unique site_status_x:", unique_site_status_x)
print("Unique id:", unique_id)
print("Unique type:", unique_type)
print("Unique units:", unique_units)
unique_id
Unique site_status_x: ['C'] Unique id: [8110553. 8110558. 8110562. 8110554. 8110523. 8110532. 8110556. 8110565. 8110522. 8110549. 8110555. 8110560. 8110564. 8110524. 8110525. 8110534. 8110548. 8110531. 8110559. 8110563. 8110550. 8110533. 8110551. 8110561. 8110489. 8110490. 8110500. 8110491. 8110492. 8110499. 8110501. 8110502. 8110606. 8110576. 8110604. 8110605. 8110569. 8110566. 8110607. 8110567. 8110594. 8110577. 8110600. 8110609. 8110578. 8110592. 8110597. 8110602. 8110575. 8110593. 8110598. 8110603. 8110599. 8110608. 8110568. 8110595.] Unique type: ['TPH.RH' 'TPH.PRESSURE' 'WS' 'PM2.5' 'PM10' 'TPH.TEMP'] Unique units: ['%' 'hPa' 'km/h' 'ug\\/m^3' 'C' 'ug/m^3']
array([8110553., 8110558., 8110562., 8110554., 8110523., 8110532.,
8110556., 8110565., 8110522., 8110549., 8110555., 8110560.,
8110564., 8110524., 8110525., 8110534., 8110548., 8110531.,
8110559., 8110563., 8110550., 8110533., 8110551., 8110561.,
8110489., 8110490., 8110500., 8110491., 8110492., 8110499.,
8110501., 8110502., 8110606., 8110576., 8110604., 8110605.,
8110569., 8110566., 8110607., 8110567., 8110594., 8110577.,
8110600., 8110609., 8110578., 8110592., 8110597., 8110602.,
8110575., 8110593., 8110598., 8110603., 8110599., 8110608.,
8110568., 8110595.])
microclimate_sensor_cleaned.head()
| site_id | gatewayhub_id_x | site_status_x | start_reading | latitude | longitude | location | local_time | id | sensor_id | value | type | units | gatewayhub_id_y | site_status_y | time | date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:15:02+00:00 | 8110553.0 | 5b | 74.38 | TPH.RH | % | arc1055 | C | 2021-07-07 | 2021-07-07 |
| 1 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:15:02+00:00 | 8110558.0 | 5c | 1010.80 | TPH.PRESSURE | hPa | arc1055 | C | 2021-07-07 | 2021-07-07 |
| 2 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:15:03+00:00 | 8110562.0 | 6 | 0.00 | WS | km/h | arc1055 | C | 2021-07-07 | 2021-07-07 |
| 3 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:30:02+00:00 | 8110554.0 | 5b | 74.19 | TPH.RH | % | arc1055 | C | 2021-07-07 | 2021-07-07 |
| 4 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:30:11+00:00 | 8110523.0 | 0a | 5.90 | PM2.5 | ug\/m^3 | arc1055 | C | 2021-07-07 | 2021-07-07 |
# Print data from microclimate sensor
unique_types = microclimate_sensor_cleaned['type'].unique()
num_rows = (len(unique_types) + 1) // 2 # +1 to handle odd number of types
fig, axs = plt.subplots(num_rows, 2, figsize=(20, 6 * num_rows))
for i, sensor_type in enumerate(unique_types):
type_data = microclimate_sensor_cleaned[microclimate_sensor_cleaned['type'] == sensor_type]
row = i // 2
col = i % 2
ax = axs[row, col] if num_rows > 1 else axs[col]
ax.plot(type_data['date'], type_data['value'], marker='o', linestyle='-')
ax.set_title(f'{sensor_type} Values Over Time')
ax.set_xlabel('Date')
ax.set_ylabel(f'Value ({type_data["units"].iloc[0]})')
ax.grid(True)
plt.tight_layout()
plt.show()
# Exclude 'TPH.RH', 'TPH.PRESSURE' due to different metrics
color_l = [color_d[0], color_d[1], color_d[2], color_d[3], 'darkolivegreen', 'mediumseagreen', 'lightgreen', 'green', 'seagreen', 'forestgreen']
unique_types = microclimate_sensor_cleaned.loc[
~microclimate_sensor_cleaned['type'].isin(['TPH.RH', 'TPH.PRESSURE']), 'type'
].unique()
plt.figure(figsize=(12, 8))
for i, sensor_type in enumerate(unique_types):
type_data = microclimate_sensor_cleaned[microclimate_sensor_cleaned['type'] == sensor_type]
plt.plot(type_data['date'], type_data['value'], marker='o', linestyle='-', color=color_l[i % len(color_l)], label=sensor_type)
plt.title('Sensor Values Over Time (excluding TPH.RH)')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend(title='Type')
plt.grid(True)
plt.show()
import pandas as pd
import ipywidgets as widgets
import re
import matplotlib.pyplot as plt
from IPython.display import display
from matplotlib.ticker import MaxNLocator
# Dropdown options
def extract_info_from_columns(df):
types = set()
depths = set()
measurements = set()
col_pattern = r"(Soil \w+) (\d+cm) #\d+"
for probe_measure in df['probe_measure'].unique():
match = re.match(col_pattern, probe_measure)
if match:
measurement, depth = match.groups()
measurements.add(measurement)
depths.add(depth)
types.add('General')
return sorted(types), sorted(depths), sorted(measurements)
types, depths, measurements = extract_info_from_columns(soil_co2_sensor_data)
# Filtering function
def filter_data(df, soil_type, depth, measurement):
regex = f"{measurement} {depth} #\d+"
filtered_df = df[df['probe_measure'].str.contains(regex, regex=True)]
return filtered_df
# Widgets
type_selector = widgets.Dropdown(options=types, description='Soil Type:')
depth_selector = widgets.Dropdown(options=depths, description='Depth:')
measurement_selector = widgets.Dropdown(options=measurements, description='Measurement:')
overlay_selector = widgets.Dropdown(options=['None', 'Temperature', 'Humidity', 'Light', 'Motion', 'VDD', 'CO2'], description='Overlay:')
output = widgets.Output()
# Plotting function
def plot_data(df, overlay):
plt.figure(figsize=(14, 6))
# Soil data
if 'soil_value' in df.columns:
plt.plot(df['datetime'], df['soil_value'], label='Soil Value', color=color_d[1], alpha=0.4)
plt.axhline(df['soil_value'].mean(), color='blue', linestyle='--', label='Avg Soil Value')
# Overlay data
if overlay != 'None' and overlay.lower() in df.columns:
plt.plot(df['datetime'], df[overlay.lower()], label=f'{overlay} Level', color=color_d[3], alpha=0.7)
plt.axhline(df[overlay.lower()].mean(), color="red", linestyle='--', label=f'Avg {overlay} Level')
plt.xlabel('Time')
plt.ylabel('Value')
plt.title('Soil and Environmental Data Over Time')
plt.legend()
plt.gca().xaxis.set_major_locator(MaxNLocator(integer=True, nbins=10)) # Adjust x-axis labels
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Update function
def update_output(change):
with output:
output.clear_output()
filtered_df = filter_data(soil_co2_sensor_data, type_selector.value, depth_selector.value, measurement_selector.value)
plot_data(filtered_df, overlay_selector.value)
type_selector.observe(update_output, names='value')
depth_selector.observe(update_output, names='value')
measurement_selector.observe(update_output, names='value')
overlay_selector.observe(update_output, names='value')
# Display widgets
display(type_selector, depth_selector, measurement_selector, overlay_selector, output)
Dropdown(description='Soil Type:', options=('General',), value='General')
Dropdown(description='Depth:', options=('10cm', '20cm', '30cm', '40cm', '50cm', '60cm', '70cm', '80cm'), value…
Dropdown(description='Measurement:', options=('Soil Moisture', 'Soil Salinity', 'Soil Temperature'), value='So…
Dropdown(description='Overlay:', options=('None', 'Temperature', 'Humidity', 'Light', 'Motion', 'VDD', 'CO2'),…
Output()
3. Predictive Analytics Model¶
- Data processing, normalization, visualisation & Model preparation step - windowing.
Develop a model to predict environmental conditions and alert for anomalies or hazardous conditions based on historical and current sensor data.¶
- Time series forecasting using TensorFlow
- Convolutional & Recurrent Neural Networks (CNNs and RNNs)
Forecast for a single time step:¶
- A single feature or all features.
Forecast multiple steps:¶
- Predictions all at once (Single-shot)
- Prediction (Autoregressive) output loop back to the LLM.
import IPython
import IPython.display
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
mpl.rcParams['figure.figsize'] = (8, 6)
mpl.rcParams['axes.grid'] = False
microclimate_sensor.head()
| site_id | gatewayhub_id_x | site_status_x | start_reading | latitude | longitude | location | local_time | id | sensor_id | value | type | units | gatewayhub_id_y | site_status_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:15:02+00:00 | 8110553.0 | 5b | 74.38 | TPH.RH | % | arc1055 | C |
| 1 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:15:02+00:00 | 8110558.0 | 5c | 1010.80 | TPH.PRESSURE | hPa | arc1055 | C |
| 2 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:15:03+00:00 | 8110562.0 | 6 | 0.00 | WS | km/h | arc1055 | C |
| 3 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:30:02+00:00 | 8110554.0 | 5b | 74.19 | TPH.RH | % | arc1055 | C |
| 4 | 1011 | arc1055 | C | 2021-07-07 | -37.822222 | 144.952222 | -37.822222, 144.952222 | 2023-11-29 22:30:11+00:00 | 8110523.0 | 0a | 5.90 | PM2.5 | ug\/m^3 | arc1055 | C |
# describe based on time
microclimate = microclimate_sensor.sort_values('local_time')
print(microclimate['local_time'].describe())
count 56 unique 14 top 2023-11-29 22:30:02+00:00 freq 6 Name: local_time, dtype: object
soil_co2_sensor_data.head()
| datetime | site_id | id | probe_measure | soil_value | unit | latitude_x | longitude_x | location | time_x | date_x | dev_id | time_y | temperature | humidity | light | motion | visit | vdd | lat_long | co2 | latitude_y | longitude_y | date_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022-12-31 13:00:00 | 64979 | 15522826 | Soil Moisture 50cm #0 | 32.63 | %VWC | -37.80170 | 144.97138 | -37.801699999983285, 144.9713799999935 | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 13:00:46 | 24.5 | 61 | 0.0 | 0 | 0 | 3656 | -37.8133183, 144.9668853 | 416.0 | -37.813318 | 144.966885 | 2022-12-31 |
| 1 | 2022-12-31 13:00:00 | 66195 | 15522987 | Soil Temperature 50cm #0 | 21.36 | ºC | -37.82073 | 144.97254 | -37.82072999998553, 144.97253999999268 | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 13:00:46 | 24.5 | 61 | 0.0 | 0 | 0 | 3656 | -37.8133183, 144.9668853 | 416.0 | -37.813318 | 144.966885 | 2022-12-31 |
| 2 | 2022-12-31 13:00:00 | 64976 | 15515162 | Soil Salinity 20cm #0 | 0.05 | µS/cm | -37.78292 | 144.95566 | -37.78291999998904, 144.95566000000628 | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 13:00:46 | 24.5 | 61 | 0.0 | 0 | 0 | 3656 | -37.8133183, 144.9668853 | 416.0 | -37.813318 | 144.966885 | 2022-12-31 |
| 3 | 2022-12-31 13:00:00 | 64986 | 15521623 | Soil Moisture 50cm #0 | 51.39 | %VWC | -37.82921 | 144.97348 | -37.8292099999859, 144.97347999998564 | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 13:00:46 | 24.5 | 61 | 0.0 | 0 | 0 | 3656 | -37.8133183, 144.9668853 | 416.0 | -37.813318 | 144.966885 | 2022-12-31 |
| 4 | 2022-12-31 13:00:00 | 64979 | 15514301 | Soil Temperature 30cm #0 | 20.63 | ºC | -37.80170 | 144.97138 | -37.801699999983285, 144.9713799999935 | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 13:00:46 | 24.5 | 61 | 0.0 | 0 | 0 | 3656 | -37.8133183, 144.9668853 | 416.0 | -37.813318 | 144.966885 | 2022-12-31 |
soil_co2_sensor_data = soil_co2_sensor_data.sort_values('datetime')
print(soil_co2_sensor_data['datetime'].describe())
count 1184290 mean 2023-08-26 21:39:28.407498240 min 2022-12-31 13:00:00 25% 2023-02-25 23:00:00 50% 2023-10-29 11:00:00 75% 2024-01-11 11:00:00 max 2024-04-01 21:00:00 Name: datetime, dtype: object
# Check values in columns before drop
# print("Unique Light:", soil_co2_sensor_data['light'].unique())
# print("Unique motion:", soil_co2_sensor_data['motion'].unique())
# print("Unique visit:", soil_co2_sensor_data['visit'].unique())
# print("Unique time_x:", soil_co2_sensor_data['time_x'].unique())
# print("Unique date_x:", soil_co2_sensor_data['date_x'].unique())
# copy first before drop
df = soil_co2_sensor_data.copy()
df = df.rename(columns={'time_x': 'time', 'date_x': 'date'})
df.drop(['latitude_x', 'longitude_x', 'location', 'latitude_y', 'longitude_y', 'date_y', 'time_y', 'visit', 'motion'], axis=1, inplace=True)
print(df.columns)
Index(['datetime', 'site_id', 'id', 'probe_measure', 'soil_value', 'unit',
'time', 'date', 'dev_id', 'temperature', 'humidity', 'light', 'vdd',
'lat_long', 'co2'],
dtype='object')
nan_counts = df.isna().sum()
print(nan_counts)
datetime 0 site_id 0 id 0 probe_measure 0 soil_value 0 unit 0 time 0 date 0 dev_id 0 temperature 0 humidity 0 light 0 vdd 0 lat_long 0 co2 0 dtype: int64
# View data
fig, ax1 = plt.subplots(figsize=(10, 5))
color = 'tab:red'
ax1.set_xlabel('Time')
ax1.set_ylabel('Temperature', color=color)
ax1.plot(df['datetime'], df['temperature'], color= color_d[3], alpha=0.5)
ax1.tick_params(axis='y', labelcolor=color)
ax2 = ax1.twinx() # 2nd axis
color = 'tab:blue'
ax2.set_ylabel('Soil Value', color=color)
ax2.plot(df['datetime'], df['soil_value'], color= color_d[2], alpha=0.5)
ax2.tick_params(axis='y', labelcolor=color)
fig.tight_layout()
plt.title('Temperature and Soil Value Over Time')
plt.show()
A. or B. Save or Load Dataset DF¶
# Save dataset
# df.to_csv('/content/drive/My Drive/sit378_eco_metrix/df.csv', index=False)
# Load dataset
# df.pd.read_csv('/content/drive/My Drive/sit378_eco_metrix/df.csv')
# hourly predictions, sub-sample data 10-minute intervals to one-hour intervals
data = df.copy()
data = data[5::6]
date_time = pd.to_datetime(data.pop('datetime'), format='%d.%m.%Y %H:%M:%S')
data.head()
| site_id | id | probe_measure | soil_value | unit | time | date | dev_id | temperature | humidity | light | vdd | lat_long | co2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 578 | 64989 | 15514033 | Soil Salinity 80cm #0 | 0.13 | µS/cm | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 24.5 | 61 | 0.0 | 3656 | -37.8133183, 144.9668853 | 416.0 |
| 584 | 66203 | 15517851 | Soil Temperature 10cm #0 | 22.18 | ºC | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 24.5 | 61 | 0.0 | 3656 | -37.8133183, 144.9668853 | 416.0 |
| 590 | 64993 | 15512581 | Soil Moisture 20cm #0 (Clayey Loam) | 25.96 | %VWC | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 24.5 | 61 | 0.0 | 3656 | -37.8133183, 144.9668853 | 416.0 |
| 596 | 64993 | 15512509 | Soil Temperature 40cm #0 | 21.04 | ºC | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 24.5 | 61 | 0.0 | 3656 | -37.8133183, 144.9668853 | 416.0 |
| 548 | 64998 | 15515032 | Soil Moisture 70cm #0 (Clay) | 45.17 | %VWC | 13:00:00 | 2022-12-31 | ers-co2-9da0 | 24.5 | 61 | 0.0 | 3656 | -37.8133183, 144.9668853 | 416.0 |
# find probe measures
print("Unique probe_measures:", data['probe_measure'].unique())
Unique probe_measures: ['Soil Salinity 80cm #0' 'Soil Temperature 10cm #0' 'Soil Moisture 20cm #0 (Clayey Loam)' 'Soil Temperature 40cm #0' 'Soil Moisture 70cm #0 (Clay)' 'Soil Moisture 30cm #0 (Clay)' 'Soil Moisture 10cm #0 (Sandy Loam)' 'Soil Moisture 40cm #0' 'Soil Salinity 40 cm #0' 'Soil Moisture 50cm #0' 'Soil Moisture 60cm #0 (Clay)' 'Soil Salinity 50cm #0' 'Soil Salinity 40cm #0' 'Soil Salinity 60cm #0' 'Soil Salinity 70cm #0' 'Soil Temperature 70cm #0' 'Soil Moisture 10cm #0 (Sand)' 'Soil Moisture 20cm #0 (Sandy Loam)' 'Soil Moisture 30cm #0' 'Soil Temperature 60cm #0' 'Soil Salinity 10cm #0' 'Soil Salinity 30cm #0' 'Soil Moisture 60cm #0' 'Soil Temperature 50cm #0' '#0 Soil Moisture 40cm' 'Soil Temperature 80cm #0' 'Soil Moisture 40cm #0 (Clay)' 'Soil Moisture 50cm #0 (Clay)' 'Soil Moisture 80cm #0' 'Soil Moisture 10cm #0' '#0 Soil Moisture 50cm' 'Soil Moisture 70cm #0' 'Soil Salinity 20cm #0' '#0 Soil Moisture 70cm' 'Soil Moisture 20cm #0' 'Soil Temperature 30cm #0' 'Soil Moisture 40cm #0 (Sand)' 'Soil Moisture 80cm #0 (Clay)' 'Soil Temperature 20cm #0' 'Soil Moisture 60cm #0 (Sandy Loam)' '#0 Soil Moisture 80cm' 'Soil Moisture 30cm #0 (Clayey Loam)' '#0 Soil Moisture 10cm Clay Loam' 'Soil Moisture 80cm #0 (Sandy Loam)' '#0 Soil Moisture 60cm' 'Soil Moisture 50cm #0 (Loam)' 'Soil Moisture 30cm #0 (Sand)' '#0 Soil Moisture 10cm' 'Soil Moisture 10cm #0 (Clayey Loam)' '#0 Soil Moisture 30cm' 'Soil Moisture 30cm #0 (Sandy Loam)' 'Soil Moisture 50cm #0 (Sandy Loam)' 'Soil Moisture 80cm #0 (Loam)' '#0 Soil Moisture 20cm' 'Soil Moisture 40cm #0 (Clayey Loam)' 'Soil Moisture 40cm #0 (Sandy Loam)' 'Soil Moisture 40cm #0 (Loam)' 'Soil Moisture 70cm #0 (Sandy Loam)' 'Soil Moisture 70cm #0 (Loam)' 'Soil Moisture 20cm #0 (Sand)' 'Soil Moisture 60cm #0 (Loam)']
# Unique key
data['unique_key'] = data['site_id'].astype(str) + '_' + data['date'].astype(str) + '_' + data['time'].astype(str)
# Pivot
pivot_data = data.pivot_table(index='unique_key', columns='probe_measure', values='soil_value', aggfunc='first').reset_index()
# Rename
pivot_data.columns = ['_'.join(col).strip() if isinstance(col, tuple) else col for col in pivot_data.columns]
# Print the count of non-NaN values
non_nan_counts = pivot_data.count()
print(non_nan_counts)
unique_key 45392
#0 Soil Moisture 10cm 2002
#0 Soil Moisture 10cm Clay Loam 202
#0 Soil Moisture 20cm 2215
#0 Soil Moisture 30cm 2289
...
Soil Temperature 40cm #0 7504
Soil Temperature 50cm #0 5545
Soil Temperature 60cm #0 5475
Soil Temperature 70cm #0 5505
Soil Temperature 80cm #0 5552
Length: 62, dtype: int64
probe_measure_counts = data['probe_measure'].value_counts()
print("Count of records for each probe_measure:")
print(probe_measure_counts)
Count of records for each probe_measure:
probe_measure
Soil Moisture 30cm #0 7709
Soil Temperature 30cm #0 7706
Soil Temperature 10cm #0 7582
Soil Moisture 20cm #0 7577
Soil Moisture 10cm #0 7560
...
#0 Soil Moisture 10cm Clay Loam 202
Soil Moisture 70cm #0 (Sandy Loam) 126
Soil Moisture 80cm #0 (Sandy Loam) 126
Soil Moisture 50cm #0 (Sandy Loam) 125
Soil Moisture 60cm #0 (Sandy Loam) 123
Name: count, Length: 61, dtype: int64
Descriptive statistics¶
- mean, standard deviation, min, max, etc
# Filter soil (Soil Moisture 30cm #0 7709)
# Convert, concatenate to datetime
data['datetime'] = pd.to_datetime(data['date'].astype(str) + ' ' + data['time'].astype(str))
filtered_data = data[data['probe_measure'] == 'Soil Moisture 30cm #0']
plot_cols = ['temperature', 'humidity', 'light', 'vdd', 'co2', 'soil_value']
filtered_data = filtered_data[plot_cols + ['datetime']].dropna(subset=plot_cols)
filtered_data.set_index('datetime', inplace=True) # set index
fig, axs = plt.subplots(len(plot_cols), 1, figsize=(10, 15))
for i, col in enumerate(plot_cols):
filtered_data[col].plot(ax=axs[i], title=col, color= color_d[2])
axs[i].set_ylabel(col)
plt.tight_layout()
plt.show()
# Plot first 480 records
if len(filtered_data) > 480:
filtered_data[:480].plot(subplots=True, figsize=(10, 12), color=color_d[1])
plt.show()
# View stats
# Filter
filtered_data = data[data['probe_measure'] == 'Soil Moisture 30cm #0']
columns = ['co2', 'vdd', 'humidity', 'temperature', 'soil_value']
stats = filtered_data[columns].describe().transpose()
stats
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| co2 | 7709.0 | 434.963938 | 83.544455 | 381.00 | 409.00 | 417.0 | 433.00 | 2209.00 |
| vdd | 7709.0 | 3651.589700 | 3.727366 | 3640.00 | 3648.00 | 3654.0 | 3654.00 | 3659.00 |
| humidity | 7709.0 | 55.932546 | 7.858678 | 34.00 | 51.00 | 55.0 | 60.00 | 86.00 |
| temperature | 7709.0 | 21.269101 | 2.214275 | 13.30 | 20.00 | 21.4 | 22.60 | 29.90 |
| soil_value | 7709.0 | 37.212963 | 13.640875 | 12.44 | 28.37 | 34.6 | 44.34 | 82.25 |
# Convert date time to seconds
timestamp_s = date_time.map(pd.Timestamp.timestamp)
Converts date_time to timestamps in seconds,¶
- Computes the sine and cosine of the time of day and time of year
- Considering cyclic nature of days/years in time series data
- Aim to review seasonal trends
# Sets cosine for prep seasonal trends
day = 24*60*60
year = (365.2425)*day
data['Day sin'] = np.sin(timestamp_s * (2 * np.pi / day))
data['Day cos'] = np.cos(timestamp_s * (2 * np.pi / day))
data['Year sin'] = np.sin(timestamp_s * (2 * np.pi / year))
data['Year cos'] = np.cos(timestamp_s * (2 * np.pi / year))
# Temp over time
# Temperature was found to be a predictor using Frander Casuality Test
# Units 'temperature', 'humidity', 'vdd', 'co2', 'soil_value'
fft = tf.signal.rfft(data['temperature'])
f_per_dataset = np.arange(0, len(fft))
n_samples_h = len(df['temperature'])
hours_per_year = 24*365.2524
years_per_dataset = n_samples_h/(hours_per_year)
f_per_year = f_per_dataset/years_per_dataset
plt.step(f_per_year, np.abs(fft), color_d[2])
plt.xscale('log')
plt.ylim(0, 100000)
plt.xlim([0.1, max(plt.xlim())])
plt.xticks([1, 365.2524], labels=['1/Year', '1/day'])
_ = plt.xlabel('Frequency (log scale) temperature')
Spilt Data for training¶
70%, 20%, 10%
Split
List of consecutive inputs, create split_window to convert inputs and a window of labels.
# Split data
if 'datetime' not in data.columns:
data['datetime'] = pd.to_datetime(data['date'].astype(str) + ' ' + data['time'].astype(str))
# Calculate indices data
columns = ['datetime', 'co2', 'vdd', 'humidity', 'temperature', 'soil_value']
data_filtered = data[columns]
data_filtered.set_index('datetime', inplace=True)
n = len(data_filtered) # calculate
train_df = data_filtered[0:int(n * 0.7)]
val_df = data_filtered[int(n * 0.7):int(n * 0.9)]
test_df = data_filtered[int(n * 0.9):]
# Features, not 'datetime' index
num_features = len(columns) - 1
column_indices = {name: i for i, name in enumerate(data_filtered.columns)}
print("Number of features:", num_features)
print("Training Set Length:", len(train_df))
print("Validation Set Length:", len(val_df))
print("Testing Set Length:", len(test_df))
Number of features: 5 Training Set Length: 138166 Validation Set Length: 39476 Testing Set Length: 19739
# Normalise data
train_mean = train_df.select_dtypes(include=['float64', 'int64']).mean()
train_std = train_df.select_dtypes(include=['float64', 'int64']).std()
# Normalisation to numeric columns
numeric_cols = train_df.select_dtypes(include=['float64', 'int64']).columns
train_df[numeric_cols] = (train_df[numeric_cols] - train_mean) / train_std
val_df[numeric_cols] = (val_df[numeric_cols] - train_mean) / train_std
test_df[numeric_cols] = (test_df[numeric_cols] - train_mean) / train_std
Violin plots for each feature in the normalised training dataset¶
# View distribution
df_std = pd.DataFrame(data=train_df, columns=numeric_cols)
df_std = df_std.melt(var_name='Column', value_name='Normalized')
plt.figure(figsize=(12, 6))
ax = sns.violinplot(x='Column', y='Normalized', data=df_std, color= color_d[2])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()
import pandas as pd
df_std = pd.DataFrame(data=train_df, columns=numeric_cols)
print("Basic Statistics for each numeric column:")
print(df_std.describe())
Basic Statistics for each numeric column:
co2 vdd humidity temperature soil_value
count 1.381660e+05 1.381660e+05 1.381660e+05 1.381660e+05 1.381660e+05
mean -6.582623e-17 -5.205538e-14 8.228279e-18 1.456405e-16 -1.411150e-16
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
min -5.293187e-01 -2.653729e+00 -3.067528e+00 -3.257021e+00 -1.386784e+00
25% -2.550561e-01 -7.480756e-01 -6.878534e-01 -6.216164e-01 -5.564468e-01
50% -1.888548e-01 6.811648e-01 -1.868692e-01 5.860924e-03 -1.701336e-01
75% -3.753753e-02 6.811648e-01 5.646071e-01 6.751700e-01 6.519596e-01
max 1.675868e+01 1.872198e+00 3.821004e+00 4.272707e+00 4.794815e+00
Review Stats¶
- Each variable has been standardised (normalised)
- Co2 high outliers compared to mean
- VDD (Voltage) potential outliers on the low end (low poer) Median is above zero, there is a right skewed distribution.
- Humidity reflects different possible seasons (median is balanced).
- Temperature has significant variation correlating to seasonal trends but median shows a symmetric distribution
Soil Value some extremem outliers potential for seasonal conditional impacts
Generate batches of temporal input-output pairs from the time series data.¶
time series analysis to predict future value based on past input windows.
Parameters defining how wide the input/ label windows should be & How far ahead the model should predict.
Stores training, validation, and test data.
Calculates indices for selecting the input and label data on split
Future aims: Feature scaling and cyclical encoding.
class WindowGenerator():
def __init__(self, input_width, label_width, shift,
train_df=train_df, val_df=val_df, test_df=test_df,
label_columns=None):
self.train_df = train_df
self.val_df = val_df
self.test_df = test_df
# Labels
self.label_columns = label_columns
if label_columns is not None:
self.label_columns_indices = {name: i for i, name in
enumerate(label_columns)}
self.column_indices = {name: i for i, name in
enumerate(train_df.columns)}
# Parameters
self.input_width = input_width
self.label_width = label_width
self.shift = shift
self.total_window_size = input_width + shift
self.input_slice = slice(0, input_width)
self.input_indices = np.arange(self.total_window_size)[self.input_slice]
self.label_start = self.total_window_size - self.label_width
self.labels_slice = slice(self.label_start, None)
self.label_indices = np.arange(self.total_window_size)[self.labels_slice]
def __repr__(self):
return '\n'.join([
f'Total window size: {self.total_window_size}',
f'Input indices: {self.input_indices}',
f'Label indices: {self.label_indices}',
f'Label column name(s): {self.label_columns}'])
Visualise Window Data¶
- And predictions made by a model
- How model is performing relative to the actual data!
window1 = WindowGenerator(input_width=24, label_width=1, shift=24,
label_columns=['temperature'])
window1
Total window size: 48 Input indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23] Label indices: [47] Label column name(s): ['temperature']
window2 = WindowGenerator(input_width=6, label_width=1, shift=1,
label_columns=['temperature'])
window2
Total window size: 7 Input indices: [0 1 2 3 4 5] Label indices: [6] Label column name(s): ['temperature']
print(train_df.columns)
Index(['co2', 'vdd', 'humidity', 'temperature', 'soil_value'], dtype='object')
Predictions based on a window of consecutive samples from the data.¶
- Data in TensorFlow = arrays where the outermost index is across examples (the "batch" dimension).
- The middle indices are the "time" or "space" (width, height) dimension(s). - The innermost indices are the features.
# Slicing doesn't preserve static shape information
# Set Manually for inspection later
def split_window(self, features):
inputs = features[:, self.input_slice, :]
labels = features[:, self.labels_slice, :]
if self.label_columns is not None:
labels = tf.stack(
[labels[:, :, self.column_indices[name]] for name in self.label_columns],
axis=-1)
inputs.set_shape([None, self.input_width, None])
labels.set_shape([None, self.label_width, None])
return inputs, labels
WindowGenerator.split_window = split_window
# Stack three slices: length of the total window
example_window = tf.stack([np.array(train_df[:window2.total_window_size]),
np.array(train_df[100:100+window2.total_window_size]),
np.array(train_df[200:200+window2.total_window_size])])
example_inputs, example_labels = window2.split_window(example_window)
print('All shapes are: (batch, time, features)')
print(f'Window shape: {example_window.shape}')
print(f'Inputs shape: {example_inputs.shape}')
print(f'Labels shape: {example_labels.shape}')
All shapes are: (batch, time, features) Window shape: (3, 7, 5) Inputs shape: (3, 6, 5) Labels shape: (3, 1, 1)
# Plot
window2.example = example_inputs, example_labels
def plot(self, model=None, plot_col='temperature', max_subplots=3):
inputs, labels = self.example
plt.figure(figsize=(12, 8))
plot_col_index = self.column_indices[plot_col]
max_n = min(max_subplots, len(inputs))
for n in range(max_n):
plt.subplot(max_n, 1, n+1)
plt.ylabel(f'{plot_col} [normed]')
plt.plot(self.input_indices, inputs[n, :, plot_col_index],
label='Inputs', marker='.', zorder=-10)
if self.label_columns:
label_col_index = self.label_columns_indices.get(plot_col, None)
else:
label_col_index = plot_col_index
if label_col_index is None:
continue
plt.scatter(self.label_indices, labels[n, :, label_col_index],
edgecolors='k', label='Labels', c='#2ca02c', s=64)
if model is not None:
predictions = model(inputs)
plt.scatter(self.label_indices, predictions[n, :, label_col_index],
marker='X', edgecolors='k', label='Predictions',
c='#ff7f0e', s=64)
if n == 0:
plt.legend()
plt.xlabel('Time [h]')
WindowGenerator.plot = plot
window2.plot()
# Plot other columns
# Please be aware - window w2 configuration only has labels for the Temperature column!
window2.plot(plot_col='co2')
Make a dataset (tf.data.Dataset) of input window and label window¶
- Uses pairs & tf.keras.utils.timeseries_dataset_from_array function
def make_dataset(self, data):
data = np.array(data, dtype=np.float32)
ds = tf.keras.utils.timeseries_dataset_from_array(
data=data,
targets=None,
sequence_length=self.total_window_size,
sequence_stride=1,
shuffle=True,
batch_size=32,)
ds = ds.map(self.split_window)
return ds
WindowGenerator.make_dataset = make_dataset
# Holds the training, validations, test data
@property
def train(self):
return self.make_dataset(self.train_df)
@property
def val(self):
return self.make_dataset(self.val_df)
@property
def test(self):
return self.make_dataset(self.test_df)
@property
def example(self):
"""Get and cache an example batch of `inputs, labels` for plotting."""
result = getattr(self, '_example', None)
if result is None:
result = next(iter(self.train)) # use .train dataset
self._example = result # cache for next time
return result
WindowGenerator.train = train
WindowGenerator.val = val
WindowGenerator.test = test
WindowGenerator.example = example
# Each element is an inputs & label pair
window2.train.element_spec
(TensorSpec(shape=(None, 6, 5), dtype=tf.float32, name=None), TensorSpec(shape=(None, 1, 1), dtype=tf.float32, name=None))
for example_inputs, example_labels in window2.train.take(1):
print(f'Inputs shape (batch, time, features): {example_inputs.shape}')
print(f'Labels shape (batch, time, features): {example_labels.shape}')
Inputs shape (batch, time, features): (32, 6, 5) Labels shape (batch, time, features): (32, 1, 1)
one_step_window = WindowGenerator(
input_width=1, label_width=1, shift=1,
label_columns=['temperature'])
one_step_window
Total window size: 2 Input indices: [0] Label indices: [1] Label column name(s): ['temperature']
for example_inputs, example_labels in one_step_window.train.take(1):
print(f'Inputs shape (batch, time, features): {example_inputs.shape}')
print(f'Labels shape (batch, time, features): {example_labels.shape}')
Inputs shape (batch, time, features): (32, 1, 5) Labels shape (batch, time, features): (32, 1, 1)
Base Model¶
- Predicts the next value in a time series as the last available value in the input window.
- To compare against more complex models.
Performance baseline as a point for comparison¶
- input as-is or extracts data from a column from the input tensor
- Model is used to evaluate performance using 'mean squared error' and 'mean absolute error' for validation datasets
Baseline, linear and dense models handle time steps differently¶
- Goal is to create a WindowGenerator that will produce batches of three-hour inputs and one-hour labels:
- Issue could be that the resulting model can may excute input windows of exactly this shape????
class Baseline(tf.keras.Model):
def __init__(self, label_index=None):
super().__init__()
self.label_index = label_index
def call(self, inputs):
if self.label_index is None:
return inputs
result = inputs[:, :, self.label_index]
return result[:, :, tf.newaxis]
baseline = Baseline(label_index=column_indices['temperature'])
baseline.compile(loss=tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.MeanAbsoluteError()])
val_performance = {}
performance = {}
val_performance['Baseline'] = baseline.evaluate(one_step_window.val, return_dict=True)
performance['Baseline'] = baseline.evaluate(one_step_window.test, verbose=0, return_dict=True)
1234/1234 [==============================] - 7s 4ms/step - loss: 0.0014 - mean_absolute_error: 0.0022
spread_window = WindowGenerator(
input_width=24, label_width=24, shift=1,
label_columns=['temperature'])
spread_window
Total window size: 25 Input indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23] Label indices: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] Label column name(s): ['temperature']
print('Input shape:', spread_window.example[0].shape)
print('Output shape:', baseline(spread_window.example[0]).shape)
Input shape: (32, 24, 5) Output shape: (32, 24, 1)
spread_window.plot(baseline) # labels shifted right by one hour
Linear model¶
- The output from a time step only depends on that step
Idea¶
- To Create instances of the WindowGenerator with different configs
- Single-step vs wide-window predictions
- Used to train models & make predictions over different time scales
# Background - tf.keras.layers.Dense layer with no activation set is a linear model - The layer only transforms the last axis of the data from (batch, time, inputs) to (batch, time, units) - Applied independently to every item across the batch and time axes
# Linear model
linear = tf.keras.Sequential([
tf.keras.layers.Dense(units=1)
])
print('Input shape:', one_step_window.example[0].shape)
print('Output shape:', linear(one_step_window.example[0]).shape)
Input shape: (32, 1, 5) Output shape: (32, 1, 1)
MAX_EPOCHS = 10
# Uses early callbacks = stops training early if val_loss, stops improving
# Oatience=2: Train until validation loss fails to improve 2 epochs
def compile_and_fit(model, window, patience=2):
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
patience=patience,
mode='min')
model.compile(loss=tf.keras.losses.MeanSquaredError(),
optimizer=tf.keras.optimizers.Adam(),
metrics=[tf.keras.metrics.MeanAbsoluteError()])
history = model.fit(window.train, epochs=MAX_EPOCHS,
validation_data=window.val,
callbacks=[early_stopping])
return history
history = compile_and_fit(linear, one_step_window)
val_performance['Linear'] = linear.evaluate(one_step_window.val, return_dict=True)
performance['Linear'] = linear.evaluate(one_step_window.test, verbose=0, return_dict=True)
print('Shape Input:', one_step_window.example[0].shape)
print('Shape Output:', linear(one_step_window.example[0]).shape)
Epoch 1/20 4318/4318 [==============================] - 35s 8ms/step - loss: 0.6268 - mean_absolute_error: 0.4062 - val_loss: 0.0055 - val_mean_absolute_error: 0.0514 Epoch 2/20 4318/4318 [==============================] - 29s 7ms/step - loss: 0.0035 - mean_absolute_error: 0.0162 - val_loss: 0.0014 - val_mean_absolute_error: 0.0048 Epoch 3/20 4318/4318 [==============================] - 34s 8ms/step - loss: 0.0030 - mean_absolute_error: 0.0073 - val_loss: 0.0014 - val_mean_absolute_error: 0.0051 Epoch 4/20 4318/4318 [==============================] - 33s 8ms/step - loss: 0.0030 - mean_absolute_error: 0.0075 - val_loss: 0.0015 - val_mean_absolute_error: 0.0056 1234/1234 [==============================] - 8s 6ms/step - loss: 0.0015 - mean_absolute_error: 0.0056 Shape Input: (32, 1, 5) Shape Output: (32, 1, 1)
spread_window.plot(linear)
# many cases the prediction is clearly better returning temperature, but in a few cases it's worse
plt.bar(x = range(len(train_df.columns)),
height=linear.layers[0].kernel[:,0].numpy())
axis = plt.gca()
axis.set_xticks(range(len(train_df.columns)))
_ = axis.set_xticklabels(train_df.columns, rotation=90)
# Model places most weight on temperature
Like baseline model, the linear model can be called on batches of wide windows.¶
- Model makes a set of independent predictions on consecutive time steps
- The time axis acts like another batch axis
There are no interactions between the predictions at each time step
Overview¶
- Blue inputs (line) shows input temperature at each time step)
- Green labels (dot) show target prediction values - it is moved one step forward
- Orange Predictions (crosses) are the models predictions for each output time step - Perfect predictions will hit the labels (dots)
# Take test batch
for inputs, labels in spread_window.test.take(1):
predictions = linear.predict(inputs)
# df for predictions and actual values
predictions_df = pd.DataFrame(data={
"Predictions": predictions.flatten(),
"Actual Values": labels.numpy().flatten()
})
# Output
print("Predictions vs Actual Values:")
print(predictions_df.head()) # Shows the first few rows
# Find model weights in df
weights = linear.layers[0].kernel[:,0].numpy()
weights_df = pd.DataFrame(data={
"Feature": train_df.columns,
"Weight": weights
})
# Output weights
print("\nModel Weights:")
print(weights_df)
print("\nDescriptive Statistics for Predictions:")
print(predictions_df.describe())
1/1 [==============================] - 0s 104ms/step
Predictions vs Actual Values:
Predictions Actual Values
0 1.175895 1.177152
1 1.171215 1.177152
2 1.178890 1.177152
3 1.184662 1.177152
4 1.177986 1.177152
Model Weights:
Feature Weight
0 co2 0.000378
1 vdd 0.001319
2 humidity 0.003004
3 temperature 0.997548
4 soil_value 0.003017
Descriptive Statistics for Predictions:
Predictions Actual Values
count 768.000000 768.000000
mean 0.631839 0.631650
std 0.554347 0.555779
min -0.503021 -0.621616
25% 0.220847 0.215020
50% 0.635616 0.633338
75% 1.012649 1.009825
max 1.724331 1.720966
Multi-step Dense Version Model - Optimse Model Test¶
- Single-time-step model no context for inputs values.
- Test multiple time steps when making predictions as the model doesn't have knowledge of input features changing overtime.
dense = tf.keras.Sequential([
tf.keras.layers.Dense(units=64, activation='relu'),
tf.keras.layers.Dense(units=64, activation='relu'),
tf.keras.layers.Dense(units=1)
])
history = compile_and_fit(dense, one_step_window)
val_performance['Dense'] = dense.evaluate(one_step_window.val, return_dict=True)
performance['Dense'] = dense.evaluate(one_step_window.test, verbose=0, return_dict=True)
Epoch 1/20 4318/4318 [==============================] - 39s 9ms/step - loss: 0.0069 - mean_absolute_error: 0.0283 - val_loss: 0.0017 - val_mean_absolute_error: 0.0136 Epoch 2/20 4318/4318 [==============================] - 48s 11ms/step - loss: 0.0035 - mean_absolute_error: 0.0185 - val_loss: 0.0022 - val_mean_absolute_error: 0.0239 Epoch 3/20 4318/4318 [==============================] - 51s 12ms/step - loss: 0.0034 - mean_absolute_error: 0.0164 - val_loss: 0.0018 - val_mean_absolute_error: 0.0169 1234/1234 [==============================] - 8s 6ms/step - loss: 0.0018 - mean_absolute_error: 0.0169
CONV_WIDTH = 3
conv_window = WindowGenerator(
input_width=CONV_WIDTH,
label_width=1,
shift=1,
label_columns=['temperature'])
conv_window
Total window size: 4 Input indices: [0 1 2] Label indices: [3] Label column name(s): ['temperature']
conv_window.plot()
plt.suptitle("3 hours of inputs = predict 1 hour")
Text(0.5, 0.98, '3 hours of inputs = predict 1 hour')
Train Dense model by flatten the first layer¶
- tf.keras.layers.Flatten
multi_step_dense = tf.keras.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(units=32, activation='relu'),
tf.keras.layers.Dense(units=32, activation='relu'),
tf.keras.layers.Dense(units=1),
tf.keras.layers.Reshape([1, -1]),
])
history = compile_and_fit(multi_step_dense, conv_window)
IPython.display.clear_output()
val_performance['Multi step dense'] = multi_step_dense.evaluate(conv_window.val, return_dict=True)
performance['Multi step dense'] = multi_step_dense.evaluate(conv_window.test, verbose=0, return_dict=True)
1234/1234 [==============================] - 4s 3ms/step - loss: 0.0015 - mean_absolute_error: 0.0098
conv_window.plot(multi_step_dense)
print('Input shape:', spread_window.example[0].shape)
try:
print('Output shape:', multi_step_dense(spread_window.example[0]).shape)
except Exception as e:
print(f'\n{type(e).__name__}:{e}')
# Only shows one shape
Input shape: (32, 24, 5) ValueError:Exception encountered when calling layer 'sequential_3' (type Sequential). Input 0 of layer "dense_7" is incompatible with the layer: expected axis -1 of input shape to have value 15, but received input with shape (32, 120) Call arguments received by layer 'sequential_3' (type Sequential): • inputs=tf.Tensor(shape=(32, 24, 5), dtype=float32) • training=None • mask=None
# convolution layer (tf.keras.layers.Conv1D) also takes multiple time steps as input to each prediction.
conv_model = tf.keras.Sequential([
tf.keras.layers.Conv1D(filters=32,
kernel_size=(CONV_WIDTH,),
activation='relu'),
tf.keras.layers.Dense(units=32, activation='relu'),
tf.keras.layers.Dense(units=1),
])
# Test model produces output with expected shape
print("Conv model on `conv_window`")
print('Input shape:', conv_window.example[0].shape)
print('Output shape:', conv_model(conv_window.example[0]).shape)
Conv model on `conv_window` Input shape: (32, 3, 5) Output shape: (32, 1, 1)
history = compile_and_fit(conv_model, conv_window)
IPython.display.clear_output()
val_performance['Conv'] = conv_model.evaluate(conv_window.val, return_dict=True)
performance['Conv'] = conv_model.evaluate(conv_window.test, verbose=0, return_dict=True)
1234/1234 [==============================] - 5s 4ms/step - loss: 0.0015 - mean_absolute_error: 0.0085
# Label and prediction require same length, increase input time steps to enable lengths to match
LABEL_WIDTH = 24
INPUT_WIDTH = LABEL_WIDTH + (CONV_WIDTH - 1)
wide_conv_window = WindowGenerator(
input_width=INPUT_WIDTH,
label_width=LABEL_WIDTH,
shift=1,
label_columns=['temperature'])
wide_conv_window
print("Wide conv window")
print('Input shape:', wide_conv_window.example[0].shape)
print('Labels shape:', wide_conv_window.example[1].shape)
print('Output shape:', conv_model(wide_conv_window.example[0]).shape)
Wide conv window Input shape: (32, 26, 5) Labels shape: (32, 24, 1) Output shape: (32, 24, 1)
wide_conv_window.plot(conv_model)
val_performance
{'Baseline': {'loss': 0.001438622479327023,
'mean_absolute_error': 0.0022306805476546288},
'Linear': {'loss': 0.001455001882277429,
'mean_absolute_error': 0.005563145503401756},
'Dense': {'loss': 0.0017582725267857313,
'mean_absolute_error': 0.016851268708705902},
'Multi step dense': {'loss': 0.0015167850069701672,
'mean_absolute_error': 0.00977565161883831},
'Conv': {'loss': 0.0015202785143628716,
'mean_absolute_error': 0.008469038642942905}}
x = np.arange(len(performance))
width = 0.3
metric_name = 'mean_absolute_error'
val_mae = [v[metric_name] for v in val_performance.values()]
test_mae = [v[metric_name] for v in performance.values()]
plt.ylabel('mean_absolute_error [T (degC), normalized]')
plt.bar(x - 0.17, val_mae, width, label='Validation')
plt.bar(x + 0.17, test_mae, width, label='Test')
plt.xticks(ticks=x, labels=performance.keys(),
rotation=45)
_ = plt.legend()
for name, value in performance.items():
print(f'{name:12s}: {value[metric_name]:0.4f}')
Baseline : 0.0028 Linear : 0.0060 Dense : 0.0192 Multi step dense: 0.0149 Conv : 0.0107
4. Data Quality Report¶
a. API Enpoints Called & Data Cleaning & Pre-processing
- Convert to datatime
- Combine datasets to identify ('site_id' check merge)
- Clean (normalise) the data.
- Address missing values
- Check data types
- Locate/ fix dataset-specific issues
Datasets (+2 million records) soil-sensor-readings-historical-data with 2400683 records soil-sensor-locations with 90 records microclimate-sensor-readings with 56 records microlabs-co2-sensor-data with 207523 records microclimate-sensor-locations with 16 records
LOCF (Last Observation Carried Forward) & Interquartile range (IQR) outliers Correlation Heatmap (Volumetric Water content, temperature and salinity). Explored time series/ trends for all data (including over 70 different types of soil data). Folium library to map sensor locations User Interfaces & merged datasets into single using datetime ** Normalisation & Descriptive statistics** temperature, humidity, light, vdd, co2 and soil value (30cm)
PCA Principal Component Analysis Soil moisture measurements at different depths. Pearson and Spearman Correlations: show a weak but statistically significant relationship between temperature and Co2 levels. Granger Causality Test Temperature was found to be a significant predictor of soil value, humidity, light. Co2 predicts motion and vdd.
b. Integrated Data Dashboard Soil type, Depth, Measure and overlay for temperature, Co2, humidity, light, vdd.
c. **Linear Regression Model** makes a set of independent predictions on consecutive time steps. **Fast Fourier Transformer (FFT)** test confirmed temperature was a predictor (normally used for sound/ wave lengths but can be used for temperature variance metrics). **Violin plots** to see distributions of normalisations. Predictions based on a *window of consecutive samples* from the data Used Tensorflow model *tf.keras model* which has built in functions for deep learning models and useful for making time predicitons. Created a base model to predict the next value in the time series data (input window) Created a linear model and examined ‘single step and wide/ spread window predictions’. Tested Dense, multi-step and convolution layer models, **however linear regression with weights based on temperature outperformed other models**. Linear model showed high accuracy and consistency across predictions y (0.63 ~ base model) with non significant diviations of 0.01 predictions.